Chemical methods for nucleic acid-based data storage

ABSTRACT

The present disclosure discloses methods and systems for encoding digital information in nucleic acid (e.g., deoxyribonucleic acid) molecules without base-by-base synthesis, by encoding bit-value information in the presence or absence of unique nucleic acid sequences within a pool, comprising specifying each bit location in a bit-stream with a unique nucleic sequence and specifying the bit value at that location by the presence or absence of the corresponding unique nucleic acid sequence in the pool. Also disclosed are chemical methods for generating unique nucleic acid sequences using combinatorial genomic strategies (e.g., assembly of multiple nucleic acid sequences or enzymatic-based editing of nucleic acid sequences).

CROSS-REFERENCE

This application is a continuation of U.S. patent application Ser. No.17/012,909 filed on Sep. 4, 2020, which is a continuation ofInternational Patent Application No. PCT/US2019/022596 filed March on15, 2019, which claims priority to U.S. Provisional Patent ApplicationNo. 62/644,323, filed on Mar. 16, 2018, each of which is entirelyincorporated herein by reference.

SEQUENCE LISTING

The instant application contains a Sequence Listing, which has beensubmitted via EFS Web and is hereby incorporated by reference in itsentirety. Said ASCII copy, created on Feb. 17, 2022 is named1885610-0002-004-302_Seq_Listing.txt and is 1,074 bytes in size.

BACKGROUND

Nucleic acid digital data storage is a stable approach for encoding andstoring information for long periods of time, with data stored at higherdensities than magnetic tape or hard drive storage systems.Additionally, digital data stored in nucleic acid molecules that arestored in cold and dry conditions can be retrieved as long as 60,000years later or longer.

To access digital data stored in nucleic acid molecules, the nucleicacid molecules may be sequenced. As such, nucleic acid digital datastorage may be an ideal method for storing data that is not frequentlyaccessed but may have a high volume of information to be stored orarchived for long periods of time.

Current methods rely on encoding the digital information (e.g., binarycode) into base-by-base nucleic acids sequences, such that the base tobase relationship in the sequence directly translates into the digitalinformation (e.g., binary code). Sequencing of digital data stored inbase-by-base sequences that can be read into bit-streams or bytes ofdigitally encoded information can be error prone and costly to encodesince the cost of de novo base-by-base nucleic acid synthesis can beexpensive. Opportunities for new methods of performing nucleic aciddigital data storage may provide approaches for encoding and retrievingdata that are less costly and easier to commercially implement.

SUMMARY

Methods and systems for encoding digital information in nucleic acid(e.g., deoxyribonucleic acid, DNA) molecules without base-by-basesynthesis, by encoding bit-value information in the presence or absenceof unique nucleic acid sequences within a pool, comprising specifyingeach bit location in a bit-stream with a unique nucleic sequence andspecifying the bit value at that location by the presence or absence ofthe corresponding unique nucleic acid sequence in the pool. But, moregenerally, specifying unique bytes in a byte stream by unique subsets ofnucleic acid sequences. Also disclosed are methods for generating uniquenucleic acid sequences without base-to-base synthesis usingcombinatorial genomic strategies (e.g., assembly of multiple nucleicacid sequences or enzymatic-based editing of nucleic acid sequences).

In an aspect, the present disclosure provides a method for writinginformation into a nucleic acid sequence, comprising: (a) generating astring of symbols to represent the information; (b) constructing aplurality of components, wherein each individual component of theplurality of components comprises a nucleic acid sequence; (c)generating at least one sticky end of the individual component of theplurality of components; (d) chemically linking together two or morecomponents of the plurality of components via the at least one stickyend of the individual component of the two or more components, therebygenerating a plurality of identifiers, wherein each identifier of theplurality of identifiers comprises two or more components, wherein anindividual identifier of the plurality of identifiers corresponds to anindividual symbol in the string of symbols; and (e) selectivelycapturing or amplifying an identifier library comprising at least asubset of the plurality of identifiers.

In some embodiments, each symbol of the string of symbols is one of oneor more possible symbol values. In some embodiments, each symbol in thestring of symbols is one of two possible symbol values. In someembodiments, one symbol value at each position of the string of symbolsmay be represented by the absence of a distinct identifier in theidentifier library. In some embodiments, the two possible symbol valuesare a bit-value of 0 and 1, wherein the individual symbol with thebit-value of 0 in the string of symbols may be represented by an absenceof a distinct identifier in the identifier library, wherein theindividual symbol with the bit-value of 1 in the string of symbols maybe represented by a presence of the distinct identifier in theidentifier library, or vice versa. In some embodiments, (d) compriseschemically linking the two or more components from two or more layersand wherein each layer of the two or more layers comprises a distinctset of components. In some embodiments, the individual identifier fromthe identifier library comprises one component from each layer of thetwo or more layers. In some embodiments, the two or more components areassembled in a fixed order. In some embodiments, the two or morecomponents are assembled in any order. In some embodiments, the two ormore components are assembled with one or more partitioning componentsdisposed between two components from different layers of the two or morelayers. In some embodiments, the individual identifier comprises onecomponent from each layer of a subset of the two or more layers. In someembodiments, the individual identifier comprises at least one componentfrom each of the two or more layers. In some embodiments, (c) comprisesusing an endonuclease to generate the at least one sticky end of theindividual component of the plurality of components. In someembodiments, the at least one sticky end is at a 5′ end of theindividual component. In some embodiments, the at least one sticky endis at a 3′ end of the individual component. In some embodiments, (c)comprises generating two sticky ends of the individual component. Insome embodiments, the at least one sticky end is at least one nucleotidein length. In some embodiments, the at least one sticky end is sixnucleotides in length. In some embodiments, the at least one sticky endcomprises a nucleic acid sequence that is selected from the groupconsisting of sequences listed in Table 4 or Table 5. In someembodiments, the plurality of nucleic acid sequences stores metadata ofthe information or conceals the information. In some embodiments, two ormore identifier libraries are combined and wherein each identifierlibrary of the two or more identifier libraries is tagged with adistinct barcode. In some embodiments, each individual identifier in theidentifier library comprises a distinct barcode or a subset identifiersof the identifier library comprises a distinct barcode. In someembodiments, the plurality of identifiers, or the plurality ofcomponents that comprise the identifiers, is selected for ease of read,write, access, copy, and deletion operations. In some embodiments,chemically linking comprises ligating together two or more components ofthe plurality of components using a reagent comprising a ligase. In someembodiments, the ligase is a T4 ligase, a T7 ligase, a T3 ligase, or anE. coli ligase. In some embodiments, the reagent further comprises anadditive. In some embodiments, the additive increases efficiency of theligase. In some embodiments, the additive comprises polyethylene glycol(PEG). In some embodiments, the PEG is PEG400, PEG6000, PEG8000 or anycombination thereof. In some embodiments, a final concentration of thePEG molecules is at least about 1% weight per volume (w/v). In someembodiments, a reaction time of the ligating is at least one minute. Insome embodiments, the ligating is at 30 degrees Celsius or higher. Insome embodiments, a reaction efficiency of the ligating is at leastabout 20%. In some embodiments, the method further comprisesinactivating the ligase using a buffer containing EDTA or guanidinethiocyanate. In some embodiments, final concentration of the ligase isat least about 5 CEU/μL. In some embodiments, the reagent furthercomprises glycerol molecules. In some embodiments, chemically linking in(d) comprises using overlap-extension polymerase chain reaction (PCR).In some embodiments, the individual component is a deoxyribonucleic acid(DNA) or a ribonucleic acid. In some embodiments, the individualcomponent has been rehydrated. In some embodiments, the individualcomponent is rehydrated from a dehydrated component. In someembodiments, the method further comprises dehydrating the identifierlibrary by dehydrating each individual identifier of at least the subsetof the plurality of identifiers. In some embodiments, each individualidentifier of at least the subset of the plurality of identifiers isdehydrated. In some embodiments, the method further comprisesrehydrating each individual identifier of at least the subset of theplurality of identifiers. In some embodiments, the method furthercomprises adding a preserving additive to the identifier library toprevent identifier degradation. In some embodiments, the plurality ofidentifiers is copied with PCR. In some embodiments, the PCR has atleast 10 cycles. In some embodiments, the plurality of identifiers isamplified with PCR up to a concentration 10 nanograms per microliter. Insome embodiments, the PCR is an emulsion PCR. In some embodiments, theplurality of identifiers is copied with linear amplification. In someembodiments, after the PCR, linear amplification is used to create morecopies of the plurality of identifiers. In some embodiments, a subset ofthe plurality of identifiers is accessed with one or more PCR reactions.In some embodiments, a subset of the plurality of identifiers isaccessed with one or more affinity tagged probes. In some embodiments,identifiers of the subset of the plurality of identifiers have a set ofcomponents in common. In some embodiments, the identifiers are purifiedby gel electrophoresis. In some embodiments, the identifiers arepurified by affinity tagged probes. In some embodiments, the identifiersare amplified using PCR. In some embodiments, the identifiers aredesigned to avoid thymine-thymine dinucleotides or cytosine-cytosinedinucleotides.

In another aspect, the present disclosure provides a method for writinginformation into a nucleic acid sequence, comprising: generating astring of symbols to represent the information; constructing a pluralityof components, wherein each individual component of the plurality ofcomponents comprises a nucleic acid sequence; generating at least onesticky end of the individual component of the plurality of components,wherein the at least one sticky end is at least six nucleotides inlength; chemically linking together two or more components of theplurality of components via the at least one sticky end of theindividual component of the two or more components, thereby generating aplurality of identifiers, wherein each identifier of the plurality ofidentifiers comprises two or more components, wherein an individualidentifier of the plurality of identifiers corresponds to an individualsymbol in the string of symbols; and selectively capturing or amplifyingan identifier library comprising at least a subset of the plurality ofidentifiers.

In some embodiments, the at least one sticky end is at a 3′ end of theindividual component. In some embodiments, the linking comprises linkingat least 15 or more components of the plurality of components. In someembodiments, the at least one sticky end comprises a nucleic acidsequence that is selected from the group consisting of sequences listedin Table 4 or Table 5.

In another aspect, provided herein is a method for writing informationinto a nucleic acid sequence, comprising: (a) generating a string ofsymbols to represent the information; (b) constructing a plurality ofsticky-end components, wherein each individual component of theplurality of components comprises a nucleic acid sequence and at leastone sticky end; (c) chemically linking together two or more componentsof the plurality of components via the at least one sticky end of theindividual component of the two or more components, thereby generating aplurality of identifiers, wherein each identifier of the plurality ofidentifiers comprises two or more components, wherein an individualidentifier of the plurality of identifiers corresponds to an individualsymbol in the string of symbols; and (d) selectively capturing oramplifying an identifier library comprising at least a subset of theplurality of identifiers. In some embodiments, (b) comprises annealingtwo oligonucleotides to construct each individual component such thateach individual component has the at least one sticky end.

In an aspect, the present disclosure provides a method for writinginformation into nucleic acid sequence(s), comprising: (a) translatingthe information into a string of symbols; (b) mapping the string ofsymbols to a plurality of identifiers, wherein an individual identifierof the plurality of identifiers comprises one or more components,wherein an individual component of the one or more components comprisesa nucleic acid sequence, and wherein the individual identifier of theplurality of identifiers corresponds to an individual symbol of thestring of symbols; and (c) constructing an identifier library comprisingat least a subset of the plurality of identifiers.

In some embodiments, each symbol in said string of symbols is one of twopossible symbol values. In some embodiments, one symbol value at eachposition of said string of symbols may be represented by the absence ofa distinct identifier in the identifier library. In some embodiments,said two possible symbol values are a bit-value of 0 and 1, wherein saidindividual symbol with said bit-value of 0 in said string of symbols maybe represented by an absence of a distinct identifier in said identifierlibrary, wherein said individual symbol with said bit-value of 1 in saidstring of symbols may be represented by a presence of said distinctidentifier in said identifier library, and vice versa. In someembodiments, each symbol of the string of symbols is one of one or morepossible symbol values. In some embodiments, a presence of an individualidentifier in the identifier library corresponds to a first symbol valuein a binary string and an absence of the individual identifiercorresponds to a second symbol value in a binary string. In someembodiments, the first symbol value is a bit value of 1 and the secondsymbol value is a bit value of 0. In some embodiments, the first symbolvalue is a bit value of 0 and the second symbol value is a bit value of1.

In some embodiments, constructing the individual identifier in theidentifier library comprises assembling the one or more components fromone or more layers and wherein each layer of the one or more layerscomprises a distinct set of components. In some embodiments, theindividual identifier from the identifier library comprises onecomponent from each layer of the one or more layers. In someembodiments, the one or more components are assembled in a fixed order.In some embodiments, the one or more components are assembled in arandom order. In some embodiments, the one or more components areassembled with one or more partitioning components disposed between twocomponents from different layers of the one or more layers. In someembodiments, the individual identifier comprises one component from eachlayer of a subset of the one or more layers. In some embodiments, theindividual identifier comprises at least one component from each of theone or more layers. In some embodiments, the one or more components areassembled using overlap-extension polymerase chain reaction (PCR),polymerase cycling assembly, sticky end ligation, biobricks assembly,golden gate assembly, gibson assembly, recombinase assembly, ligasecycling reaction, or template directed ligation.

In some embodiments, constructing the individual identifier in theidentifier library comprises deleting, replacing, or inserting at leastone component in a parent identifier by applying nucleic acid editingenzymes to the parent identifier. In some embodiments, the parentidentifier comprises a plurality of components flanked bynuclease-specific target sites, recombinase recognition sites, ordistinct spacer sequences. In some embodiments, the nucleic acid editingenzymes are selected from the group consisting of CRISPR-Cas, TALENs,Zinc Finger Nucleases, Recombinases, and functional variants thereof.

In some embodiments, the identifier library comprises a plurality ofnucleic acid sequences. In some embodiments, the plurality of nucleicacid sequences stores metadata of the information and/or conceals theinformation. In some embodiments, the metadata comprises secondaryinformation corresponding to a source of the information, an intendedrecipient of the information, an original format of the information,instrumentation and methods used to encode the information, a date and atime of writing the information into the identifier library,modifications made to the information, and/or a reference to otherinformation.

In some embodiments, one or more identifier libraries are combined andwherein each identifier library of the one or more identifier librariesis tagged with a distinct barcode. In some embodiments, each individualidentifier in the identifier library comprises the distinct barcode. Insome embodiments, the plurality of identifiers is selected for ease ofread, write, access, copy, and deletion operations. In some embodiments,the plurality of identifiers is selected to minimize write errors,mutations, degradation, and read errors.

In another aspect, the present disclosure provides a method for copyinginformation encoded in nucleic acid sequence(s), comprising: (a)providing an identifier library encoding a string of symbols, whereinthe identifier library comprises a plurality of identifiers, wherein anindividual identifier of the plurality of identifiers comprises one ormore components, wherein an individual component of the one or morecomponents comprises a nucleic acid sequence, and wherein the individualidentifier of the plurality of identifiers corresponds to an individualsymbol of the string of symbols; and (b) constructing one or more copiesof the identifier library.

In some embodiments, the plurality of identifiers comprises one or moreprimer binding sites. In some embodiments, the identifier library iscopied using nucleic acid amplification such polymerase chain reaction(PCR) (See Chemical Methods Section D). In some embodiments, the PCR isconventional PCR or linear PCR and wherein a number of copies of theidentifier library double or increase linearly, respectively, with eachPCR cycle. In some embodiments, the individual identifier in theidentifier library is ligated into a circular vector prior to PCR andwherein the circle vector comprises correlated barcodes at each end ofthe individual identifier, such that if any unintended DNA cross-overevents occur during the PCR, the resulting misformed molecules will bedetectable in sequencing. In some embodiments, the PCR is isothermal. Insome embodiments, the PCR is a form of rolling circle amplification. Insome embodiments, the PCR is emulsion PCR (ePCR).

In some embodiments, the identifier library comprises a plurality ofnucleic acid sequences. In some embodiments, the plurality of nucleicacid sequences is copied. In some embodiments, one or more identifierlibraries are combined prior to copying and wherein each library of theone or more identifier libraries comprises a distinct barcode.

In another aspect, the present disclosure provides a method foraccessing information encoded in nucleic acid sequence(s), comprising:(a) providing an identifier library encoding a string of symbols,wherein the identifier library comprises a plurality of identifiers,wherein an individual identifier of the plurality of identifierscomprises one or more components, wherein an individual component of theone or more components comprises a nucleic acid sequence, and whereinthe individual identifier of the plurality of identifiers corresponds toan individual symbol of the string of symbols; and (b) extracting atargeted subset of the plurality of identifiers from the identifierlibrary.

In some embodiments, a plurality of probes is combined with theidentifier library. In some embodiments, the plurality of probes sharecomplementarity with the targeted subset of the plurality of identifiersfrom the identifier library. In some embodiments, the plurality ofprobes hybridizes the targeted subset of the plurality of identifiers inthe identifier library. In some embodiments, the plurality of probescomprises one or more affinity tags and wherein the one or more affinitytags is captured by an affinity bead or an affinity column, in a processthat may be referred to as nucleic acid capture (see Chemical MethodsSection F on nucleic acid capture).

In some embodiments, the identifier library is sequentially combinedwith one or more subsets of the plurality of probes and wherein aportion of the identifier library binds to the one or more subsets ofthe plurality of probes. In some embodiments, the portion of theidentifier library that binds to the one or more subsets of theplurality of probes is removed prior to the addition of another subsetof the plurality of probes to the identifier library. In theseembodiments of nucleic acid capture, the captured nucleic acids may beremoved from the identifier pool instead of preserved.

In some embodiments, the individual identifier of the plurality ofidentifiers comprises one or more common primer binding regions, one ormore variable primer binding regions, or any combination thereof. Insome embodiments, the identifier library is combined with primers thatbind to the one or more common primer binding regions or to the one ormore variable primer binding regions. In some embodiments, the primersthat bind to the one or more variable primer binding regions are used toselectively amplify the targeted subset of the identifier library (seeChemical Methods Section D).

In some embodiments, a portion of identifiers is removed from theidentifier library by selective nuclease cleavage. In some embodiments,the identifier library is combined with Cas9 and guide probes andwherein the guide probes guide the Cas9 to remove specified identifiersfrom the identifier library. In some embodiments, the individualidentifiers are single-stranded and wherein the identifier library iscombined with a single-strand specific endonuclease(s). In someembodiments, the identifier library is mixed with a complementary set ofindividual identifiers that protect target individual identifiers fromdegradation prior to the addition of the single-strand specificendonuclease(s). In some embodiments, the individual identifiers thatare not cleaved by the selective nuclease cleavage are separated bysize-selective chromatography (see Chemical Methods Section E on nucleicacid size selection). In some embodiments, the individual identifiersthat are not cleaved by the selective nuclease cleavage are amplifiedand wherein the individual identifiers that are cleaved by the selectivenuclease cleavage are not amplified (see Chemical Methods Section D onnucleic acid amplification). In some embodiments, the individualidentifiers that are not cleaved by the selective nuclease cleavage arecaptured and wherein the individual identifiers that are cleaved by theselective nuclease cleavage are not captured (see Chemical MethodsSection F on nucleic acid capture). In some embodiments, the identifierlibrary comprises a plurality of nucleic acid sequences and wherein theplurality of nucleic acid sequences are extracted with the targetedsubset of the plurality of identifiers in the identifier library.

In another aspect, the present disclosure provides a method for readinginformation encoded in nucleic acid sequence(s), comprising: (a)providing an identifier library comprising a plurality of identifiers,wherein an individual identifier of the plurality of identifierscomprises one or more components, wherein an individual component of theone or more components comprises a nucleic acid sequence; (b)identifying the plurality of identifiers in the identifier library; (c)generating a plurality of symbols from the plurality of identifiersidentified in (b), wherein an individual symbol of the plurality ofsymbols corresponds to the individual identifier of the plurality ofidentifiers; and (d) compiling the information from the plurality ofsymbols.

In some embodiments, each symbol in said string of symbols is one of twopossible symbol values. In some embodiments, one symbol value at eachposition of said string of symbols may be represented by the absence ofa distinct identifier in the identifier library. In some embodiments,said two possible symbol values are a bit-value of 0 and 1, wherein saidindividual symbol with said bit-value of 0 in said string of symbols maybe represented by an absence of a distinct identifier in said identifierlibrary, wherein said individual symbol with said bit-value of 1 in saidstring of symbols may be represented by a presence of said distinctidentifier in said identifier library, and vice versa. In someembodiments, a presence of an individual identifier in the identifierlibrary corresponds to a first symbol value in a binary string and anabsence of the individual identifier in the identifier librarycorresponds to a second symbol value in a binary string. In someembodiments, the first symbol value is a bit value of 1 and the secondsymbol value is a bit value of 0. In some embodiments, the first symbolvalue is a bit value of 0 and the second symbol value is a bit value of1.

In some embodiments, identifying the plurality of identifiers comprisessequencing the plurality of identifiers in the identifier library. Insome embodiments, sequencing comprises digital polymerase chain reaction(PCR), quantitative PCR, a microarray, sequencing by synthesis, ormassively-parallel sequencing. In some embodiments, the identifierlibrary comprises a plurality of nucleic acid sequences. In someembodiments, the plurality of nucleic acid sequences store metadata ofthe information and/or conceal the information. In some embodiments, oneor more identifier libraries are combined and wherein each identifierlibrary in the one or more identifier libraries comprises a distinctbarcode. In some embodiments, the barcode stores metadata of theinformation.

In another aspect, the present disclosure provides a method for nucleicacid-based computer data storage, comprising: (a) receiving computerdata, (b) synthesizing nucleic acid molecules comprising nucleic acidsequences encoding the computer data, wherein the computer data isencoded in at least a subset of nucleic acid molecules synthesized andnot in a sequence of each of the nucleic acid molecules, and (c) storingthe nucleic acid molecules having the nucleic acid sequences.

In some embodiments, the at least the subset of the nucleic acidmolecules are grouped together. In some embodiments, the method furthercomprises sequencing the nucleic acid molecule(s) to determine thenucleic acid sequence(s), thereby retrieving the computer data. In someembodiments, (b) is performed in a time period that is less than about 1day. In some embodiments, (b) is performed at an accuracy of at leastabout 90%.

In another aspect, the present disclosure provides a method for nucleicacid-based computer data storage, comprising: (a) receiving computerdata, (b) synthesizing a nucleic acid molecule comprising at least onenucleic acid sequence encoding the computer data, which synthesizing thenucleic acid molecule is in the absence of base-by-base nucleic acidsynthesis, and (c) storing the nucleic acid molecule comprising the atleast one nucleic acid sequence.

In some embodiments, the method further comprises sequencing the nucleicacid molecule to determine the nucleic acid sequence, thereby retrievingthe computer data. In some embodiments, (b) is performed in a timeperiod that is less than about 1 day. In some embodiments, (b) isperformed at an accuracy of at least about 90%.

In another aspect, the present disclosure provides a system for encodingbinary sequence data using nucleic acids, comprising: a deviceconfigured to construct an identifier library, wherein the identifierlibrary comprises a plurality of identifiers, wherein an individualidentifier of the plurality of identifiers comprises one or morecomponents, and wherein an individual component of the one or morecomponents is a nucleic acid sequence; and one or more computerprocessors operatively coupled to the device, wherein the one or morecomputer processors are individually or collectively programmed to (i)translate the information into a string of symbols, (ii) map the stringof symbols to the plurality of identifiers, wherein the individualidentifier of the plurality of identifiers corresponds to an individualsymbol of the string of symbols, and (iii) construct an identifierlibrary comprising the plurality of identifiers.

In some embodiments, the device comprises a plurality of partitions andwherein the identifier library is generated in one or more of theplurality of partitions. In some embodiments, the plurality ofpartitions comprises wells. In some embodiments, constructing theindividual identifier in the identifier library comprises assembling theone or more components from one or more layers and wherein each layer ofthe one or more layers comprises a distinct set of components. In someembodiments, each layer of the one or more layers is stored in aseparate portion of the device and wherein the device is configured tocombine the one or more components from the one or more layers. In someembodiments, the identifier library comprises a plurality of nucleicacid sequences. In some embodiments, one or more identifier librariesare combined in a single area of the device and wherein each identifierlibrary of the one or more identifier libraries comprises a distinctbarcode.

In another aspect, the present disclosure provides a system for readinginformation encoded in nucleic acid sequence(s), comprising: a databasethat stores an identifier library comprising a plurality of identifiers,wherein an individual identifier of the plurality of identifierscomprises one or more components, wherein an individual component of theone or more components comprises a nucleic acid sequence; and one ormore computer processors operatively coupled to the database, whereinthe one or more computer processors are individually or collectivelyprogrammed to (i) identify the plurality of identifiers in theidentifier library, (ii) generate a plurality of symbols from theplurality of identifiers identified in (i), wherein an individual symbolof the plurality of symbols corresponds to the individual identifier ofthe plurality of identifiers, and (iii) compile the information from theplurality of symbols.

In some embodiments, the system further comprises a plurality ofpartitions. In some embodiments, the partitions are wells. In someembodiments, a given partition of the plurality of partitions comprisesone or more identifier libraries and wherein each identifier library ofthe one or more identifier libraries comprises a distinct barcode. Insome embodiments, the system further comprises a detection unitconfigured to identify the plurality of identifiers in the identifierlibrary.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.To the extent publications and patents or patent applicationsincorporated by reference contradict the disclosure contained in thespecification, the specification is intended to supersede and/or takeprecedence over any such contradictory material.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings (also “Figure” and “FIG.” herein), of which:

FIG. 1 schematically illustrates an overview of a process for encoding,writing, accessing, reading, and decoding digital information stored innucleic acid sequences;

FIG. 2A and FIG. 2B schematically illustrate an example method ofencoding digital data, referred to as “data at address”, using objectsor identifiers (e.g., nucleic acid molecules);

FIG. 2A illustrates combining a rank object (or address object) with abyte-value object (or data object) to create an identifier; FIG. 2Billustrates an embodiment of the data at address method wherein the rankobjects and byte-value objects are themselves combinatorialconcatenations of other objects;

FIG. 3A and FIG. 3B schematically illustrate an example method ofencoding digital information using objects or identifiers (e.g., nucleicacid sequences); FIG. 3A illustrates encoding digital information usinga rank object as an identifier; FIG. 3B illustrates an embodiment of theencoding method wherein the address objects are themselves combinatorialconcatenations of other objects;

FIG. 4 shows a contour plot, in log space, of a relationship between thecombinatorial space of possible identifiers (C, x-axis) and the averagenumber of identifiers (k, y-axis) that may be constructed to storeinformation of a given size (contour lines);

FIG. 5 schematically illustrates an overview of a method for writinginformation to nucleic acid sequences (e.g., deoxyribonucleic acid);

FIG. 6A and FIG. 6B illustrate an example method, referred to as the“product scheme”, for constructing identifiers (e.g., nucleic acidmolecules) by combinatorially assembling distinct components (e.g.,nucleic acid sequences); FIG. 6A illustrates the architecture ofidentifiers constructed using the product scheme; FIG. 6B illustrates anexample of the combinatorial space of identifiers that may beconstructed using the product scheme;

FIG. 7 schematically illustrates the use of overlap extension polymerasechain reaction to construct identifiers (e.g., nucleic acid molecules)from components (e.g., nucleic acid sequences);

FIG. 8 schematically illustrates the use of sticky end ligation toconstruct identifiers (e.g., nucleic acid molecules) from components(e.g., nucleic acid sequences);

FIG. 9 schematically illustrates the use of recombinase assembly toconstruct identifiers (e.g., nucleic acid molecules) from components(e.g., nucleic acid sequences);

FIG. 10A and FIG. 10B demonstrates template directed ligation; FIG. 10Aschematically illustrates the use of template directed ligation toconstruct identifiers (e.g., nucleic acid molecules) from components(e.g., nucleic acid sequences); FIG. 10B shows a histogram of the copynumbers (abundances) of 256 distinct nucleic acid sequences that wereeach combinatorially assembled from six nucleic acid sequences (e.g.,components) in one pooled template directed ligation reaction;

FIG. 11A, FIG. 11B, FIG. 11C, FIG. 11D, FIG. 11E, FIG. 11F, and FIG. 11Gschematically illustrate an example method, referred to as the“permutation scheme”, for constructing identifiers (e.g., nucleic acidmolecules) with permuted components (e.g., nucleic acid sequences); FIG.11A illustrates the architecture of identifiers constructed using thepermutation scheme; FIG. 11B illustrates an example of the combinatorialspace of identifiers that may be constructed using the permutationscheme; FIG. 11C shows an example implementation of the permutationscheme with template directed ligation; FIG. 11D shows an example of howthe implementation from FIG. 11C may be modified to constructidentifiers with permuted and repeated components; FIG. 11E shows howthe example implementation from FIG. 11D may lead to unwanted byproductsthat may be removed with nucleic acid size selection; FIG. 11F showsanother example of how to use template directed ligation and sizeselection to construct identifiers with permuted and repeatedcomponents; FIG. 11G shows an example of when size selection may fail toisolate a particular identifier from unwanted byproducts;

FIG. 12A, FIG. 12B, FIG. 12C, and FIG. 12D schematically illustrate anexample method, referred to as the “MchooseK” scheme, for constructingidentifiers (e.g., nucleic acid molecules) with any number, K, ofassembled components (e.g., nucleic acid sequences) out of a largernumber, M, of possible components; FIG. 12A illustrates the architectureof identifiers constructed using the MchooseK scheme; FIG. 12Billustrates an example of the combinatorial space of identifiers thatmay be constructed using the MchooseK scheme; FIG. 12C shows an exampleimplementation of the MchooseK scheme using template directed ligation;FIG. 12D shows how the example implementation from FIG. 12C may lead tounwanted byproducts that may be removed with nucleic acid sizeselection;

FIG. 13A and FIG. 13B schematically illustrates an example method,referred to as the “partition scheme” for constructing identifiers withpartitioned components; FIG. 13A shows an example of the combinatorialspace of identifiers that may be constructed using the partition scheme;FIG. 13B shows an example implementation of the partition scheme usingtemplate directed ligation;

FIG. 14A and FIG. 14B schematically illustrates an example method,referred to as the “unconstrained string” (or USS) scheme, forconstructing identifiers made up of any string of components from anumber of possible components; FIG. 14A shows an example of thecombinatorial space of identifiers that may be constructed using the USSscheme; FIG. 14B shows an example implementation of the USS scheme usingtemplate directed ligation;

FIG. 15A and FIG. 15B schematically illustrates an example method,referred to as “component deletion” for constructing identifiers byremoving components from a parent identifier; FIG. 15A shows an exampleof the combinatorial space of identifiers that may be constructed usingthe component deletion scheme; FIG. 15B shows an example implementationof the component deletion scheme using double stranded targeted cleavageand repair;

FIG. 16 schematically illustrates a parent identifier with recombinaserecognition sites where further identifiers may be constructed byapplying recombinases to the parent identifier;

FIG. 17A, FIG. 17B, and FIG. 17C schematically illustrate an overview ofexample methods for accessing portions of information stored in nucleicacid sequences by accessing a number of particular identifiers from alarger number of identifiers; FIG. 17A shows example methods for usingpolymerase chain reaction, affinity tagged probes, and degradationtargeting probes to access identifiers containing a specified component;FIG. 17B shows example methods for using polymerase chain reaction toperform ‘OR’ or ‘AND’ operations to access identifiers containingmultiple specified components; FIG. 17C shows example methods for usingaffinity tags to perform ‘OR’ or ‘AND’ operations to access identifierscontaining multiple specified components;

FIG. 18A and FIG. 18B show examples of encoding, writing, and readingdata encoded in nucleic acid molecules; FIG. 18A shows an example ofencoding, writing, and reading 5,856 bits of data; FIG. 18 b shows anexample of encoding, writing, and reading 62,824 bits of data; and

FIG. 19 shows a computer system that is programmed or otherwiseconfigured to implement methods provided herein.

FIG. 20 shows an example scheme of assembly of any two selecteddouble-stranded components from a single parent set of double-strandedcomponents.

FIG. 21 shows possible sticky-end component structures made from twooligos, X and Y.

FIG. 22 shows an exemplary gel electrophoresis image of qPCR productsfrom 15-piece, sticky-ended DNA component ligations.

FIG. 23A shows exemplary data for ligation efficiency of 15-piece,6-base 5′ overhang DNA component sets ligated for 2, 2.5, 3, and 1440minutes.

FIG. 23B shows exemplary data for ligation efficiency of 15-piece,6-base 3′ DNA component sets ligated for 2, 2.5, 3, and 1440 minutes.

FIG. 23C shows an exemplary gel electrophoresis image of the qPCRproducts.

FIG. 24A shows exemplary data presenting the ligation efficiency for DNAcomponent pairs grouped by overhang lengths.

FIG. 24B shows exemplary data presenting the ligation efficiency for DNAcomponent pairs grouped by overhang lengths.

FIG. 25A shows exemplary data presenting the ligation efficiency for DNAcomponent pairs grouped by GC content.

FIG. 25B shows exemplary data presenting the ligation efficiency for DNAcomponent pairs grouped by GC content.

FIG. 26 shows exemplary data from the ligation of 4 sticky-ended (with6-base, 3′ overhangs) DNA components, ligated together with T4 ligase atvarious temperatures.

FIG. 27 shows exemplary data from the ligation of 4 sticky-ended (with6-base, 3′ overhangs) DNA components, ligated together with T4 ligase atvarious temperatures

FIG. 28A shows exemplary data for ligation efficiencies of T7 DNAligase, as compared to T4 DNA ligase.

FIG. 28B shows exemplary data for ligation efficiencies of T3 DNAligase, as compared to T4 DNA ligase.

FIG. 29 shows exemplary data for ligation efficiencies of E. coli DNALigase at various concentrations.

FIG. 30A shows exemplary data from the ligation of 4 sticky-ended (with6-base, 3′ overhangs) DNA components, ligated together with T7 DNAligase at various temperatures.

FIG. 30B shows exemplary data from the ligation of 4 sticky-ended (with6-base, 3′ overhangs) DNA components, ligated together with T3 DNAligase at various temperatures.

FIG. 31A shows exemplary data of effects of PEG8000 on ligationefficiency.

FIG. 31B shows exemplary data of effects of PEG6000 on ligationefficiency.

FIG. 31C shows exemplary data of effects of PEG400 on ligationefficiency.

FIG. 32 shows exemplary data from ligation of four sticky-ended (with10-base, 3′ overhangs) DNA components ligated together in the presenceof PEG400 or PEG6000.

FIG. 33 shows exemplary qPCR data of effects of buffer QG or EDTA onligase.

FIG. 34 shows exemplary data on the linearity of replication using Q5,Phusion, and Taq DNA polymerase.

FIG. 35 shows an exemplary gel image of different DNA samples stored atroom temperature for 4 days.

FIG. 36 shows exemplary data for DNA repeatedly being dried andre-hydrated at room temperature.

FIG. 37 shows an exemplary scheme of constructed sticky end sequences.

FIG. 38A shows exemplary data from the ligation of different pairs ofoverhang sequences listed in Table 4.

FIG. 38B shows exemplary data from the ligation of different pairs ofoverhang sequences listed in Table 5.

FIG. 39 shows penalty scores from 2 million subsets of 15 overhangs fromeach set of overhangs listed in Table 4 and Table 5.

FIG. 40 shows exemplary data for ligation efficiency of 16 DNAcomponents using the overhangs from the final row of Table 7.

FIG. 41A shows a 341×351 reference map of an encoded message (aftercomputational encoding).

FIG. 41B shows a heat map (341×351) of the abundances of sequencespresent in the identifier library as determined by sequencing.

FIG. 42 shows exemplary data from a duplicate run of the entireencoding, writing, sequencing, and decoding process as shown in FIGS.41A-B.

FIG. 43A shows a heat map (341×351) of the abundances of sequencespresent in the replicated identifier library as determined bysequencing. The data were obtained from creating multiple copies of theoriginal identifier library containing the message from FIGS. 41A-B.

FIG. 43B shows the correlation between identifier copy numbers in theoriginal identifier library versus the replicated identifier library.

FIG. 43C shows the distribution of identifier copy numbers in theoriginal identifier library versus the replicated identifier library.

FIG. 44A shows a heat map (341×351) of the abundances of sequencespresent in the accessed identifier library as determined by sequencing.The data were obtained from accessing a portion of the identifierlibrary containing the original message from FIGS. 41A-B.

FIG. 44B shows the correlation between identifier copy numbers in theoriginal library versus the accessed identifier library.

FIG. 44C shows the distribution of identifier copy numbers in theoriginal identifier library versus the accessed identifier library.

FIG. 45A shows a heat map (341×351) of the abundances of sequencespresent in the 2× accessed identifier library as determined bysequencing. The data were obtained from further accessing a sub-portionof the accessed identifier library from FIGS. 44A-C.

FIG. 45B shows the correlation between identifier copy numbers in theoriginal library versus the 2× accessed identifier library.

FIG. 45C shows the distribution of identifier copy numbers in theoriginal identifier library versus the 2× accessed identifier library.

FIG. 46A shows a heat map (341×351) of the abundances of sequencespresent in the stored identifier library as determined by sequencing.The data were obtained from after storing the original identifierlibrary representing the message from FIGS. 41A-B at 100° C. for 4 days.

FIG. 46B shows the correlation between identifier copy numbers in theoriginal identifier library versus the replicated identifier library.

FIG. 46C shows the distribution of identifier copy numbers in theoriginal identifier library versus the replicated identifier library.

FIG. 47A shows exemplary data for DNA samples incubated for 8 days at75.1° C.

FIG. 47B shows exemplary data for DNA samples incubated for 8 days at84.4° C.

FIG. 47C shows exemplary data for DNA samples incubated for 8 days at90.2° C.

FIG. 47D shows exemplary data for DNA samples incubated for 8 days at95.0° C.

FIG. 48 shows exemplary data from ligation of four sticky-ended (with6-base, 3′ overhangs) DNA components ligated together with variousamounts (in terms of percent volume-per-volume) of glycerol.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

The term “symbol,” as used herein, generally refers to a representationof a unit of digital information. Digital information may be divided ortranslated into a string of symbols. In an example, a symbol may be abit and the bit may have a value of ‘0’ or ‘1’.

The term “distinct,” or “unique,” as used herein, generally refers to anobject that is distinguishable from other objects in a group. Forexample, a distinct, or unique, nucleic acid sequence may be a nucleicacid sequence that does not have the same sequence as any other nucleicacid sequence. A distinct, or unique, nucleic acid molecule may not havethe same sequence as any other nucleic acid molecule. The distinct, orunique, nucleic acid sequence or molecule may share regions ofsimilarity with another nucleic acid sequence or molecule.

The term “component,” as used herein, generally refers to a nucleic acidsequence. A component may be a distinct nucleic acid sequence. Acomponent may be concatenated or assembled with one or more othercomponents to generate other nucleic acid sequence or molecules.

The term “layer,” as used herein, generally refers to group or pool ofcomponents. Each layer may comprise a set of distinct components suchthat the components in one layer are different from the components inanother layer. Components from one or more layers may be assembled togenerate one or more identifiers.

The term “identifier,” as used herein, generally refers to a nucleicacid molecule or a nucleic acid sequence that represents the positionand value of a bit-string within a larger bit-string. More generally, anidentifier may refer to any object that represents or corresponds to asymbol in a string of symbols. In some embodiments, identifiers maycomprise one or multiple concatenated components.

The term “combinatorial space,” as used herein generally refers to theset of all possible distinct identifiers that may be generated from astarting set of objects, such as components, and a permissible set ofrules for how to modify those objects to form identifiers. The size of acombinatorial space of identifiers made by assembling or concatenatingcomponents may depend on the number of layers of components, the numberof components in each layer, and the particular assembly method used togenerate the identifiers.

The term “identifier rank,” as used herein generally refers to arelation that defines the order of identifiers in a set.

The term “identifier library,” as used herein generally refers to acollection of identifiers corresponding to the symbols in a symbolstring representing digital information. In some embodiments, theabsence of a given identifier in the identifier library may indicate asymbol value at a particular position. One or more identifier librariesmay be combined in a pool, group, or set of identifiers. Each identifierlibrary may include a unique barcode that identifies the identifierlibrary.

The term “nucleic acid,” as used herein, general refers todeoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a variantthereof. A nucleic acid may include one or more subunits selected fromadenosine (A), cytosine (C), guanine (G), thymine (T), and uracil (U),or variants thereof. A nucleotide can include A, C, G, T, or U, orvariants thereof. A nucleotide can include any subunit that can beincorporated into a growing nucleic acid strand. Such subunit can be A,C, G, T, or U, or any other subunit that may be specific to one of morecomplementary A, C, G, T, or U, or complementary to a purine (i.e., A orG, or variant thereof) or pyrimidine (i.e., C, T, or U, or variantthereof). In some examples, a nucleic acid may be single-stranded ordouble stranded, in some cases, a nucleic acid is circular.

The terms “nucleic acid molecule” or “nucleic acid sequence,” as usedherein, generally refer to a polymeric form of nucleotides, orpolynucleotide, that may have various lengths, eitherdeoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof.The term “nucleic acid sequence” may refer to the alphabeticalrepresentation of a polynucleotide; alternatively, the term may beapplied to the physical polynucleotide itself. This alphabeticalrepresentation can be input into databases in a computer having acentral processing unit and used for mapping nucleic acid sequences ornucleic acid molecules to symbols, or bits, encoding digitalinformation. Nucleic acid sequences or oligonucleotides may include oneor more non-standard nucleotide(s), nucleotide analog(s) and/or modifiednucleotides.

An “oligonucleotide”, as used herein, generally refers to asingle-stranded nucleic acid sequence, and is typically composed of aspecific sequence of four nucleotide bases: adenine (A); cytosine (C);guanine (G), and thymine (T) or uracil (U) when the polynucleotide isRNA.

Examples of modified nucleotides include, but are not limited todiaminopurine, 5-fluorouracil, 5-bromouracil, 5-chlorouracil,5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine,5-(carboxyhydroxylmethyl)uracil,5-carboxymethylaminomethyl-2-thiouridine, dihydrouracil,beta-D-galactosylqueosine, inosine, N6-isopentenyladenine,1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine,2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine,7-methylguanine, 5-methylaminomethyluracil,5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine,5′-methoxycarboxymethyluracil, 5-methoxyuracil,2-methylthio-D46-isopentenyladenine, uracil-5-oxyacetic acid (v),wybutoxosine, pseudouracil, queosine, 2-thiocytosine,5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil,uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v),5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl)uracil, (acp3)w,2,6-diaminopurine and the like. Nucleic acid molecules may also bemodified at the base moiety (e.g., at one or more atoms that typicallyare available to form a hydrogen bond with a complementary nucleotideand/or at one or more atoms that are not typically capable of forming ahydrogen bond with a complementary nucleotide), sugar moiety orphosphate backbone. Nucleic acid molecules may also containamine-modified groups, such as aminoallyl-dUTP (aa-dUTP) andaminohexhylacrylamide-dCTP (aha-dCTP) to allow covalent attachment ofamine reactive moieties, such as N-hydroxy succinimide esters (NHS).

The term “primer,” as used herein, generally refers to a strand ofnucleic acid that serves as a starting point for nucleic acid synthesis,such as polymerase chain reaction (PCR). In an example, duringreplication of a DNA sample, an enzyme that catalyzes replication startsreplication at the 3′-end of a primer attached to the DNA sample andcopies the opposite strand. See Chemical Methods Section D for moreinformation on PCR, including details about primer design.

The term “polymerase” or “polymerase enzyme,” as used herein, generallyrefers to any enzyme capable of catalyzing a polymerase reaction.Examples of polymerases include, without limitation, a nucleic acidpolymerase. The polymerase can be naturally occurring or synthesized. Anexample polymerase is a Φ29 polymerase or derivative thereof. In somecases, a transcriptase or a ligase is used (i.e., enzymes which catalyzethe formation of a bond) in conjunction with polymerases or as analternative to polymerases to construct new nucleic acid sequences.Examples of polymerases include a DNA polymerase, a RNA polymerase, athermostable polymerase, a wild-type polymerase, a modified polymerase,E. coli DNA polymerase I, T7 DNA polymerase, bacteriophage T4 DNApolymerase Φ29 (phi29) DNA polymerase, Taq polymerase, Tth polymerase,Tli polymerase, Pfu polymerase Pwo polymerase, VENT polymerase, DEEPVENTpolymerase, Ex-Taq polymerase, LA-Taw polymerase, Sso polymerase Pocpolymerase, Pab polymerase, Mth polymerase ES4 polymerase, Trupolymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tcapolymerase, Tih polymerase, Tfi polymerase, Platinum Taq polymerases,Tbr polymerase, Tfl polymerase, Pfutubo polymerase, Pyrobest polymerase,KOD polymerase, Bst polymerase, Sac polymerase, Klenow fragmentpolymerase with 3′ to 5′ exonuclease activity, and variants, modifiedproducts and derivatives thereof. See Chemical Methods Section D foradditional polymerases that may be used with PCR as well as for detailson how polymerase characteristics may affect PCR.

Digital information, such as computer data, in the form of binary codecan comprise a sequence or string of symbols. A binary code may encodeor represent text or computer processor instructions using, for example,a binary number system having two binary symbols, typically 0 and 1,referred to as bits. Digital information may be represented in the formof non-binary code which can comprise a sequence of non-binary symbols.Each encoded symbol can be re-assigned to a unique bit string (or“byte”), and the unique bit string or byte can be arranged into stringsof bytes or byte streams. A bit value for a given bit can be one of twosymbols (e.g., 0 or 1). A byte, which can comprise a string of N bits,can have a total of 2N unique byte-values. For example, a bytecomprising 8 bits can produce a total of 2⁸ or 256 possible uniquebyte-values, and each of the 256 bytes can correspond to one of 256possible distinct symbols, letters, or instructions which can be encodedwith the bytes. Raw data (e.g., text files and computer instructions)can be represented as strings of bytes or byte streams. Zip files, orcompressed data files comprising raw data can also be stored in bytestreams, these files can be stored as byte streams in a compressed form,and then decompressed into raw data before being read by the computer.

Methods and systems of the present disclosure may be used to encodecomputer data or information in a plurality of identifiers, each ofwhich may represent one or more bits of the original information. Insome examples, methods and systems of the present disclosure encode dataor information using identifiers that each represents two bits of theoriginal information.

Previous methods for encoding digital information into nucleic acidshave relied on base-by-base synthesis of the nucleic acids, which can becostly and time consuming. Alternative methods may improve theefficiency, improve the commercial viability of digital informationstorage by reducing the reliance on base-by-base nucleic acid synthesisfor encoding digital information, and eliminate the de novo synthesis ofdistinct nucleic acid sequences for every new information storagerequest.

New methods can encode digital information (e.g., binary code) in aplurality of identifiers, or nucleic acid sequences, comprisingcombinatorial arrangements of components instead of relying onbase-by-base or de-novo nucleic acid synthesis (e.g., phosphoramiditesynthesis). As such, new strategies may produce a first set of distinctnucleic acid sequences (or components) for the first request ofinformation storage, and can there-after re-use the same nucleic acidsequences (or components) for subsequent information storage requests.These approaches can significantly reduce the cost of DNA-basedinformation storage by reducing the role of de-novo synthesis of nucleicacid sequences in the information-to-DNA encoding and writing process.Moreover, unlike implementations of base-by-base synthesis, such asphosphoramidite chemistry- or template-free polymerase-based nucleicacid elongation, which may use cyclical delivery of each base to eachelongating nucleic acid, new methods of information-to-DNA writing usingidentifier construction from components are highly parallelizableprocesses that do not necessarily use cyclical nucleic acid elongation.Thus, new methods may increase the speed of writing digital informationto DNA compared to older methods.

Methods for Encoding and Writing Information to Nucleic Acid Sequence(s)

In an aspect, the present disclosure provides methods for encodinginformation into nucleic acid sequences. A method for encodinginformation into nucleic acid sequences may comprise (a) translating theinformation into a string of symbols, (b) mapping the string of symbolsto a plurality of identifiers, and (c) constructing an identifierlibrary comprising at least a subset of the plurality of identifiers. Anindividual identifier of the plurality of identifiers may comprise oneor more components. An individual component of the one or morecomponents may comprise a nucleic acid sequence. Each symbol at eachposition in the string of symbols may correspond to a distinctidentifier. The individual identifier may correspond to an individualsymbol at an individual position in the string of symbols. Moreover, onesymbol at each position in the string of symbols may correspond to theabsence of an identifier. For example, in a string of binary symbols(e.g., bits) of ‘0’s and ‘1’s, each occurrence of ‘0’ may correspond tothe absence of an identifier.

In another aspect, the present disclosure provides methods for nucleicacid-based computer data storage. A method for nucleic acid-basedcomputer data storage may comprise (a) receiving computer data, (b)synthesizing nucleic acid molecules comprising nucleic acid sequencesencoding the computer data, and (c) storing the nucleic acid moleculeshaving the nucleic acid sequences. The computer data may be encoded inat least a subset of nucleic acid molecules synthesized and not in asequence of each of the nucleic acid molecules.

In another aspect, the present disclosure provides methods for writingand storing information in nucleic acid sequences. The method maycomprise, (a) receiving or encoding a virtual identifier library thatrepresents information, (b) physically constructing the identifierlibrary, and (c) storing one or more physical copies of the identifierlibrary in one or more separate locations. An individual identifier ofthe identifier library may comprise one or more components. Anindividual component of the one or more components may comprise anucleic acid sequence.

In another aspect, the present disclosure provides methods for nucleicacid-based computer data storage. A method for nucleic acid-basedcomputer data storage may comprise (a) receiving computer data, (b)synthesizing a nucleic acid molecule comprising at least one nucleicacid sequence encoding the computer data, and (c) storing the nucleicacid molecule comprising the at least one nucleic acid sequence.Synthesizing the nucleic acid molecule may be in the absence ofbase-by-base nucleic acid synthesis.

In another aspect, the present disclosure provides methods for writingand storing information in nucleic acid sequences. A method for writingand storing information in nucleic acid sequences may comprise, (a)receiving or encoding a virtual identifier library that representsinformation, (b) physically constructing the identifier library, and (c)storing one or more physical copies of the identifier library in one ormore separate locations. An individual identifier of the identifierlibrary may comprise one or more components. An individual component ofthe one or more components may comprise a nucleic acid sequence.

FIG. 1 illustrates an overview process for encoding information intonucleic acid sequences, writing information to the nucleic acidsequences, reading information written to nucleic acid sequences, anddecoding the read information. Digital information, or data, may betranslated into one or more strings of symbols. In an example, thesymbols are bits and each bit may have a value of either ‘0’ or ‘1’.Each symbol may be mapped, or encoded, to an object (e.g., identifier)representing that symbol. Each symbol may be represented by a distinctidentifier. The distinct identifier may be a nucleic acid molecule madeup of components. The components may be nucleic acid sequences. Thedigital information may be written into nucleic acid sequences bygenerating an identifier library corresponding to the information. Theidentifier library may be physically generated by physicallyconstructing the identifiers that correspond to each symbol of thedigital information. All or any portion of the digital information maybe accessed at a time. In an example, a subset of identifiers isaccessed from an identifier library. The subset of identifiers may beread by sequencing and identifying the identifiers. The identifiedidentifiers may be associated with their corresponding symbol to decodethe digital data.

A method for encoding and reading information using the approach of FIG.1 can, for example, include receiving a bit stream and mapping eachone-bit (bit with bit-value of ‘1’) in the bit stream to a distinctnucleic acid identifier using an identifier rank or a nucleic acidindex. Constructing a nucleic acid sample pool, or identifier library,comprising copies of the identifiers that correspond to bit values of 1(and excluding identifiers for bit values of 0). Reading the sample cancomprise using molecular biology methods (e.g., sequencing,hybridization, PCR, etc), determining which identifiers are representedin the identifier library, and assigning bit-values of ‘1’ to the bitscorresponding to those identifiers and bit-values of ‘0’ elsewhere(again referring to the identifier rank to identify the bits in theoriginal bit-stream that each identifier corresponds to), thus decodingthe information into the original encoded bit stream.

Encoding a string of N distinct bits, can use an equivalent number ofunique nucleic acid sequences as possible identifiers. This approach toinformation encoding may use de-novo synthesis of identifiers (e.g.,nucleic acid molecules) for each new item of information (string of Nbits) to store. In other instances, the cost of newly synthesizingidentifiers (equivalent in number to or less than N) for each new itemof information to store can be reduced by the one-time de-novo synthesisand subsequent maintenance of all possible identifiers, such thatencoding new items of information may involve mechanically selecting andmixing together pre-synthesized (or pre-fabricated) identifiers to forman identifier library. In other instances, both the cost of (1) de-novosynthesis of up to N identifiers for each new item of information tostore or (2) maintaining and selecting from N possible identifiers foreach new item of information to store, or any combination thereof, maybe reduced by synthesizing and maintaining a number (less than N, and insome cases much less than N) of nucleic acid sequences and thenmodifying these sequences through enzymatic reactions to generate up toN identifiers for each new item of information to store.

The identifiers may be rationally designed and selected for ease ofread, write, access, copy, and deletion operations. The identifiers maybe designed and selected to minimize write errors, mutations,degradation, and read errors. See Chemical Methods Section H on therational design of DNA sequences that comprise synthetic nucleic acidlibraries (such as identifier libraries).

FIGS. 2A and 2B schematically illustrate an example method, referred toas “data at address”, of encoding digital data in objects or identifiers(e.g., nucleic acid molecules). FIG. 2A illustrates encoding a bitstream into an identifier library wherein the individual identifiers areconstructed by concatenating or assembling a single component thatspecifies an identifier rank with a single component that specifies abyte-value. In general, the data at address method uses identifiers thatencode information modularly by comprising two objects: one object, the“byte-value object” (or “data object”), that identifies a byte-value andone object, the “rank object” (or “address object”), that identifies theidentifier rank (or the relative position of the byte in the originalbit-stream). FIG. 2B illustrates an example of the data at addressmethod wherein each rank object may be combinatorially constructed froma set of components and each byte-value object may be combinatoriallyconstructed from a set of components. Such combinatorial construction ofrank and byte-value objects enables more information to be written intoidentifiers than if the objects where made from the single componentsalone (e.g., FIG. 2A).

FIGS. 3A and 3B schematically illustrate another example method ofencoding digital information in objects or identifiers (e.g., nucleicacid sequences). FIG. 3A illustrates encoding a bit stream into anidentifier library wherein identifiers are constructed from singlecomponents that specify identifier rank. The presence of an identifierat a particular rank (or address) specifies a bit-value of ‘1’ and theabsence of an identifier at a particular rank (or address) specifies abit-value of ‘0’. This type of encoding may use identifiers that solelyencode rank (the relative position of a bit in the original bit stream)and use the presence or absence of those identifiers in an identifierlibrary to encode a bit-value of ‘1’ or ‘0’, respectively. Reading anddecoding the information may include identifying the identifiers presentin the identifier library, assigning bit-values of ‘1’ to theircorresponding ranks and assigning bit-values of ‘0’ elsewhere. FIG. 3Billustrates an example encoding method where each identifier may becombinatorially constructed from a set of components such that eachpossible combinatorial construction specifies a rank. Such combinatorialconstruction enables more information to be written into identifiersthan if the identifiers where made from the single components alone(e.g., FIG. 3A). For example, a component set may comprise five distinctcomponents. The five distinct components may be assembled to generateten distinct identifiers, each comprising two of the five components.The ten distinct identifiers may each have a rank (or address) thatcorresponds to the position of a bit in a bit stream. An identifierlibrary may include the subset of those ten possible identifiers thatcorresponds to the positions of bit-value ‘1’, and exclude the subset ofthose ten possible identifiers that corresponds to the positions of thebit-value ‘0’ within a bit stream of length ten.

FIG. 4 shows a contour plot, in log space, of a relationship between thecombinatorial space of possible identifiers (C, x-axis) and the averagenumber of identifiers (k, y-axis) to be physically constructed in orderto store information of a given original size in bits (D, contour lines)using the encoding method shown in FIGS. 3A and 3B. This plot assumesthat the original information of size D is re-coded into a string of Cbits (where C may be greater than D) where a number of bits, k, has abit-value of ‘1’. Moreover, the plot assumes thatinformation-to-nucleic-acid encoding is performed on the re-coded bitstring and that identifiers for positions where the bit-value is ‘1’ areconstructed and identifiers for positions where the bit-value is ‘0’ arenot constructed. Following the assumptions, the combinatorial space ofpossible identifiers has size C to identify every position in there-coded bit string, and the number of identifiers used to encode thebit string of size D is such that D=log₂(Cchoosek), where Cchoosek maybe the mathematical formula for the number of ways to pick k unorderedoutcomes from C possibilities. Thus, as the combinatorial space ofpossible identifiers increases beyond the size (in bits) of a given itemof information, a decreasing number of physically constructedidentifiers may be used to store the given information.

FIG. 5 shows an overview method for writing information into nucleicacid sequences. Prior to writing the information, the information may betranslated into a string of symbols and encoded into a plurality ofidentifiers. Writing the information may include setting up reactions toproduce possible identifiers. A reaction may be set up by depositinginputs into a compartment. The inputs may comprise nucleic acids,components, templates, enzymes, or chemical reagents. The compartmentmay be a well, a tube, a position on a surface, a chamber in amicrofluidic device, or a droplet within an emulsion. Multiple reactionsmay be set up in multiple compartments. Reactions may proceed to produceidentifiers through programmed temperature incubation or cycling.Reactions may be selectively or ubiquitously removed (e.g., deleted).Reactions may also be selectively or ubiquitously interrupted,consolidated, and purified to collect their identifiers in one pool.Identifiers from multiple identifier libraries may be collected in thesame pool. An individual identifier may include a barcode or a tag toidentify to which identifier library it belongs. Alternatively, or inaddition to, the barcode may include metadata for the encodedinformation. Supplemental nucleic acids or identifiers may also beincluded in an identifier pool together with an identifier library. Thesupplemental nucleic acids or identifiers may include metadata for theencoded information or serve to obfuscate or conceal the encodedinformation.

An identifier rank (e.g., nucleic acid index) can comprise a method orkey for determining the ordering of identifiers. The method can comprisea look-up table with all identifiers and their corresponding rank. Themethod can also comprise a look up table with the rank of all componentsthat constitute identifiers and a function for determining the orderingof any identifier comprising a combination of those components. Such amethod may be referred to as lexicographical ordering and may beanalogous to the manner in which words in a dictionary arealphabetically ordered. In the data at address encoding method, theidentifier rank (encoded by the rank object of the identifier) may beused to determine the position of a byte (encoded by the byte-valueobject of the identifier) within a bit stream. In an alternative method,the identifier rank (encoded by the entire identifier itself) for apresent identifier may be used to determine the position of bit-value of‘1’ within a bit stream.

A key may assign distinct bytes to unique subsets of identifiers (e.g.,nucleic acid molecules) within a sample. For example, in a simple form,a key may assign each bit in a byte to a unique nucleic acid sequencethat specifies the position of the bit, and then the presence or absenceof that nucleic acid sequence within a sample may specify the bit-valueof 1 or 0, respectively. Reading the encoded information from thenucleic acid sample can comprise any number of molecular biologytechniques including sequencing, hybridization, or PCR. In someembodiments, reading the encoded dataset may comprise reconstructing aportion of the dataset or reconstructing the entire encoded dataset fromeach nucleic acid sample. When the sequence may be read the nucleic acidindex can be used along with the presence or absence of a unique nucleicacid sequence and the nucleic acid sample can be decoded into a bitstream (e.g., each string of bits, byte, bytes, or string of bytes).

Identifiers may be constructed by combinatorially assembling componentnucleic acid sequences. For example, information may be encoded bytaking a set of nucleic acid molecules (e.g., identifiers) from adefined group of molecules (e.g., combinatorial space). Each possibleidentifier of the defined group of molecules may be an assembly ofnucleic acid sequences (e.g., components) from a prefabricated set ofcomponents that may be divided into layers. Each individual identifiermay be constructed by concatenating one component from every layer in afixed order. For example, if there are M layers and each layer may haven components, then up to C=n^(M) unique identifiers may be constructedand up to 2^(C) different items of information, or C bits, may beencoded and stored. For example, storage of a megabit of information mayuse 1×10⁶ distinct identifiers or a combinatorial space of size C=1×106. The identifiers in this example may be assembled from a variety ofcomponents organized in different ways. Assemblies may be made from M=2prefabricated layers, each containing n=1×10³ components. Alternatively,assemblies may be made from M=3 layers, each containing n=1×10²components. As this example illustrates, encoding the same amount ofinformation using a larger number of layers may allow for the totalnumber of components to be smaller. Using a smaller number of totalcomponents may be advantageous in terms of writing cost.

In an example, one can start with two sets of unique nucleic acidsequences or layers, X and Y, each with x and y components (e.g.,nucleic acid sequences), respectively. Each nucleic acid sequence from Xcan be assembled to each nucleic acid sequence from Y. Though the totalnumber of nucleic acid sequences maintained in the two sets may be thesum of x and y, the total number of nucleic acid molecules, and hencepossible identifiers, that can be generated may be the product of x andy. Even more nucleic acid sequences (e.g., identifiers) can be generatedif the sequences from X can be assembled to the sequences of Y in anyorder. For example, the number of nucleic acid sequences (e.g.,identifiers) generated may be twice the product of x and y if theassembly order is programmable. This set of all possible nucleic acidsequences that can be generated may be referred to as XY. The order ofthe assembled units of unique nucleic acid sequences in XY can becontrolled using nucleic acids with distinct 5′ and 3′ ends, andrestriction digestion, ligation, polymerase chain reaction (PCR), andsequencing may occur with respect to the distinct 5′ and 3′ ends of thesequences. Such an approach can reduce the total number of nucleic acidsequences (e.g., components) used to encode N distinct bits, by encodinginformation in the combinations and orders of their assembly products.For example, to encode 100 bits of information, two layers of 10distinct nucleic acid molecules (e.g., component) may be assembled in afixed order to produce 10*10 or 100 distinct nucleic acid molecules(e.g., identifiers), or one layer of 5 distinct nucleic acid molecules(e.g., components) and another layer of 10 distinct nucleic acidmolecules (e.g., components) may be assembled in any order to produce100 distinct nucleic acid molecules (e.g., identifiers).

Nucleic acid sequences (e.g., components) within each layer may comprisea unique (or distinct) sequence, or barcode, in the middle, a commonhybridization region on one end, and another common hybridization regionon another other end. The barcode may contain a sufficient number ofnucleotides to uniquely identify every sequence within the layer. Forexample, there are typically four possible nucleotides for each baseposition within a barcode. Therefore, a three base barcode may uniquelyidentify 4 3=64 nucleic acid sequences. The barcodes may be designed tobe randomly generated. Alternatively, the barcodes may be designed toavoid sequences that may create complications to the constructionchemistry of identifiers or sequencing. Additionally, barcodes may bedesigned so that each may have a minimum hamming distance from the otherbarcodes, thereby decreasing the likelihood that base-resolutionmutations or read errors may interfere with the proper identification ofthe barcode. See Chemical Methods Section H on the rational design ofDNA sequences.

The hybridization region on one end of the nucleic acid sequence (e.g.,component) may be different in each layer, but the hybridization regionmay be the same for each member within a layer. Adjacent layers arethose that have complementary hybridization regions on their componentsthat allow them to interact with one another. For example, any componentfrom layer X may be able to attach to any component from layer Y becausethey may have complementary hybridization regions. The hybridizationregion on the opposite end may serve the same purpose as thehybridization region on the first end. For example, any component fromlayer Y may attach to any component of layer X on one end and anycomponent of layer Z on the opposite end.

FIGS. 6A and 6B illustrate an example method, referred to as the“product scheme”, for constructing identifiers (e.g., nucleic acidmolecules) by combinatorially assembling a distinct component (e.g.,nucleic acid sequence) from each layer in a fixed order. FIG. 6Aillustrates the architecture of identifiers constructed using theproduct scheme. An identifier may be constructed by combining a singlecomponent from each layer in a fixed order. For M layers, each with Ncomponents, there are N^(M) possible identifiers. FIG. 6B illustrates anexample of the combinatorial space of identifiers that may beconstructed using the product scheme. In an example, a combinatorialspace may be generated from three layers each comprising three distinctcomponents. The components may be combined such that one component fromeach layer may be combined in a fixed order. The entire combinatorialspace for this assembly method may comprise twenty-seven possibleidentifiers.

FIGS. 7-10 illustrate chemical methods for implementing the productscheme (see FIG. 6 ). Methods depicted in FIGS. 7-10 , along with anyother methods for assembling two or more distinct components in a fixedorder may be used, for example, to produce any one or more identifiersin an identifier library. Identifiers may be constructed using any ofthe implementation methods described in FIGS. 7-10 , at any time duringthe methods or systems disclosed herein. In some instances, all or aportion of the combinatorial space of possible identifiers may beconstructed before digital information is encoded or written, and thenthe writing process may involve mechanically selecting and pooling theidentifiers (that encode the information) from the already existing set.In other instances, the identifiers may be constructed after one or moresteps of the data encoding or writing process may have occurred (i.e.,as information is being written).

Enzymatic reactions may be used to assemble components from thedifferent layers or sets. Assembly can occur in a one pot reactionbecause components (e.g., nucleic acid sequences) of each layer havespecific hybridization or attachment regions for components of adjacentlayers. For example, a nucleic acid sequence (e.g., component) X1 fromlayer X, a nucleic acid sequence Y1 from layer Y, and a nucleic acidsequence Z1 from layer Z may form the assembled nucleic acid molecule(e.g., identifier) X1Y1Z1. Additionally, multiple nucleic acid molecules(e.g., identifiers) may be assembled in one reaction by includingmultiple nucleic acid sequences from each layer. For example, includingboth Y1 and Y2 in the one pot reaction of the previous example may yieldtwo assembled products (e.g., identifiers), X1Y1Z1 and X1Y2Z1. Thisreaction multiplexing may be used to speed up writing time for theplurality of identifiers that are physically constructed. See ChemicalMethods Section H for detail about the rational design of DNA sequencesas it pertains to assembly efficiency. Assembly of the nucleic acidsequences may be performed in a time period that is less than or equalto about 1 day, 12 hours, 10 hours, 9 hours, 8 hours, 7 hours, 6 hours,5 hours, 4 hours, 3 hours, 2 hours, or 1 hour. The accuracy of theencoded data may be at least about or equal to about 90%, 95%, 96%, 97%,98%, 99%, or greater.

Identifiers may be constructed in accordance with the product schemeusing overlap extension polymerase chain reaction (OEPCR), asillustrated in FIG. 7 . Each component in each layer may comprise adouble-stranded or single stranded (as depicted in the figure) nucleicacid sequence with a common hybridization region on the sequence endthat may be homologous and/or complementary to the common hybridizationregion on the sequence end of components from an adjacent layer. Anindividual identifier may be constructed by concatenating one component(e.g., unique sequence) from a layer X (or layer 1) comprisingcomponents X₁-X_(A), a second component (e.g., unique sequence) from alayer Y (or layer 2) comprising Y₁-Y_(A), and a third component (e.g.,unique sequence) from layer Z (or layer 3) comprising Z₁-Z_(B). Thecomponents from layer X may have a 3′ end that shares complementaritywith the 3′ end on components from layer Y. Thus single-strandedcomponents from layer X and Y may be annealed together at the 3′ end andmay be extended using PCR to generate a double-stranded nucleic acidmolecule. The generated double-stranded nucleic-acid molecule may bemelted to generate a 3′ end that shares complementarity with a 3′ end ofa component from layer Z. A component from layer Z may be annealed withthe generated nucleic acid molecule and may be extended to generate aunique identifier comprising a single component from layers X, Y, and Zin a fixed order. See Chemical Methods Section A about OEPCR. DNA sizeselection (e.g., with gel extraction, see Chemical Methods Section E) orpolymerase chain reaction (PCR) with primers flanking the outer mostlayers (see Chemical Methods Section D) may be implemented to isolatefully assembled identifier products from other byproducts that may formin the reaction. Sequential nucleic acid capture with two probes, onefor each of the two outermost layers, may also be implemented to isolatefully assembled identifier products from other byproducts that may formin the reaction (see Chemical Methods Section F).

Identifiers may be assembled in accordance with the product scheme usingsticky end ligation, as illustrated in FIG. 8 . Three layers, eachcomprising double stranded components (e.g., double stranded DNA(dsDNA)) with single-stranded 3′ overhangs, can be used to assembledistinct identifiers. For example, identifiers comprising one componentfrom the layer X (or layer 1) comprising components X₁-X_(A), a secondcomponent from the layer Y (or layer 2) comprising Y₁-Y_(B), and a thirdcomponent from the layer Z (or layer 3) comprising Z₁-Z_(C). To combinecomponents from layer X with components from layer Y, the components inlayer X can comprise a common 3′ overhang, FIG. 8 labeled a, and thecomponents in layer Y can comprise a common, complementary 3′ overhang,a*. To combine components from layer Y with components from layer Z, theelements in layer Y can comprise a common 3′ overhang, FIG. 8 labeled b,and the elements in layer Z can comprise a common, complementary 3′overhang, b*. The 3′ overhang in layer X components can be complementaryto the 3′ end in layer Y components and the other 3′ overhang in layer Ycomponents can be complementary to the 3′ end in layer Z componentsallowing the components to hybridize and ligate. As such, componentsfrom layer X cannot hybridize with other components from layer X orlayer Z, and similarly components from layer Y cannot hybridize withother elements from layer Y. Furthermore, a single component from layerY can ligate to a single component of layer X and a single component oflayer Z, ensuring the formation of a complete identifier. See ChemicalMethods Section B about sticky end ligation. DNA size selection (e.g.,with gel extraction, see Chemical Methods Section E) or polymerase chainreaction (PCR) with primers flanking the outer most layers (see ChemicalMethods Section D) may be implemented to isolate identifier productsfrom other byproducts that may form in the reaction. Sequential nucleicacid capture with two probes, one for each of the two outermost layers,may also be implemented to isolate identifier products from otherbyproducts that may form in the reaction (see Chemical Methods SectionF).

The sticky ends for sticky end ligation may be generated by treating thecomponents of each layer with restriction endonucleases (see ChemicalMethods Section C for more information about restriction enzymereactions). In some embodiments, the components of multiple layers maybe generated from one “parent” set of components. For example, anembodiment wherein a single parent set of double-stranded components mayhave complementary restrictions sites on each end (e.g., restrictionsites for BamHI and BgIII). Any two components may be selected forassembly, and individually digested with one or the other complementaryrestriction enzymes (e.g., BgIII or BamHI) resulting in complementarysticky ends that can be ligated together resulting in an inert scar. Theproduct nucleic acid sequence may comprise the complementary restrictionsites on each end (e.g., BamHI on the 5′ end and BgIII on the 3′ end),and can be further ligated to another component from the parent setfollowing the same process. This process may cycle indefinitely (FIG. 20). If the parent comprises N components, then each cycle may beequivalent to adding an extra layer of N components to the productscheme.

A method for using ligation to construct a sequence of nucleic acidscomprising elements from set X (e.g., set 1 of dsDNA) and elements fromset Y (e.g., set 2 of dsDNA) can comprise the steps of obtaining orconstructing two or more pools (e.g., set 1 of dsDNA and set 2 of dsDNA)of double stranded sequences wherein a first set (e.g., set 1 of dsDNA)comprises a sticky end (e.g., a) and a second set (e.g., set 2 of dsDNA)comprises a sticky end (e.g., a*) that is complementary to the stickyend of the first set. Any DNA from the first set (e.g., set 1 of dsDNA)and any subset of DNA from the second set (e.g., set 2 of dsDNA) can mecombined and assembled and then ligated together to form a single doublestranded DNA with an element from the first set and an element from thesecond set.

Identifiers may be assembled in accordance with the product scheme usingsite specific recombination, as illustrated in FIG. 9 . Identifiers maybe constructed by assembling components from three different layers. Thecomponents in layer X (or layer 1) may comprise double-strandedmolecules with an attB_(x) recombinase site on one side of the molecule,components from layer Y (or layer 2) may comprise double-strandedmolecules with an attP_(x) recombinase site on one side and an attB_(y)recombinase site on the other side, and components in layer Z (or layer3) may comprise an attP_(y) recombinase site on one side of themolecule. attB and attP sites within a pair, as indicate by theirsubscripts, are capable of recombining in the presence of theircorresponding recombinase enzyme. One component from each layer may becombined such that one component from layer X associates with onecomponent from layer Y, and one component from layer Y associates withone component from layer Z. Application of one or more recombinaseenzymes may recombine the components to generate a double-strandedidentifier comprising the ordered components. DNA size selection (forexample with gel extraction) or PCR with primers flanking the outer mostlayers may be implemented to isolate identifier products from otherbyproducts that may form in the reaction. In general, multipleorthogonal attB and attP pairs may be used, and each pair may be used toassemble a component from an extra layer. For the large-serine family ofrecombinases, up to six orthogonal attB and attP pairs may be generatedper recombinases, and multiple orthogonal recombinases may beimplemented as well. For example, thirteen layers may be assembled byusing twelve orthogonal attB and attP pairs, six orthogonal pairs fromeach of two large serine recombinases, such as BxbI and PhiC31.Orthogonality of attB and attP pairs ensures that an attB site from onepair does not react with an attP site from another pair. This enablescomponents from different layers to be assembled in a fixed order.Recombinase-mediated recombination reactions may be reversible orirreversible depending on the recombinase system implemented. Forexample, the large serine recombinase family catalyzes irreversiblerecombination reactions without requiring any high energy cofactors,whereas the tyrosine recombinase family catalyzes reversible reactions.

Identifiers may be constructed in accordance with the product schemeusing template directed ligation (TDL), as shown in FIG. 10A. Templatedirected ligation utilizes single stranded nucleic acid sequences,referred to as “templates” or “staples”, to facilitate the orderedligation of components to form identifiers. The templates simultaneouslyhybridize to components from adjacent layers and hold them adjacent toeach other (3′ end against 5′ end) while a ligase ligates them. In theexample from FIG. 10A, three layers or sets of single-strandedcomponents are combined. A first layer of components (e.g., layer X orlayer 1) that share common sequences a on their 3′ end, which arecomplementary to sequences a*; a second layer of components (e.g., layerY or layer 2) that share common sequences b and c on their 5′ and 3′ends respectively, which are complementary to sequences b* and c*; athird layer of components (e.g., layer Z or layer 3) that share commonsequence d on their 5′ end, which may be complementary to sequences d*;and a set of two templates or “staples” with the first staple comprisingthe sequence a*b* (5′ to 3′) and the second staple comprising a sequencec*d* (′5 to 3′). In this example, one or more components from each layermay be selected and mixed into a reaction with the staples, which, bycomplementary annealing may facilitate the ligation of one componentfrom each layer in a defined order to form an identifier. See ChemicalMethods Section B about TDL. DNA size selection (e.g., with gelextraction, see Chemical Methods Section E) or polymerase chain reaction(PCR) with primers flanking the outer most layers (see Chemical MethodsSection D) may be implemented to isolate identifier products from otherbyproducts that may form in the reaction. Sequential nucleic acidcapture with two probes, one for each of the two outermost layers, mayalso be implemented to isolate identifier products from other byproductsthat may form in the reaction (see Chemical Methods Section F).

FIG. 10B shows a histogram of the copy numbers (abundances) of 256distinct nucleic acid sequences that were each assembled with 6-layerTDL. The edge layers (first and final layers) each had one component,and each of the internal layers (remaining 4 four layers) had fourcomponents. Each edge layer component was 28 bases including a 10 basehybridization region. Each internal layer component was 30 basesincluding a 10 base common hybridization region on the 5′ end, a 10 basevariable (barcode) region, and a 10 base common hybridization region onthe 3′ end. Each of the three template strands was 20 bases in length.All 256 distinct sequences were assembled in a multiplex fashion withone reaction containing all of the components and templates, T4Polynucleotide Kinase (for phosphorylating the components), and T4Ligase, ATP, and other proper reaction reagents. The reaction wasincubated at 37 degrees for 30 minutes and then room temperature for 1hour. Sequencing adapters were added to the reaction product with PCR,and the product was sequenced with an Illumina MiSeq instrument. Therelative copy number of each distinct assembled sequence out of 192910total assembled sequence reads is shown. Other embodiments of thismethod may use double stranded components, where the components areinitially melted to form single stranded versions that can anneal to thestaples. Other embodiments or derivatives of this method (i.e., TDL) maybe used to construct a combinatorial space of identifiers more complexthan what may be accomplished in the product scheme.

Identifiers may be constructed in accordance with the product schemeusing various other chemical implementations including golden gateassembly, gibson assembly, and ligase cycling reaction assembly.

FIGS. 11A and 11B schematically illustrate an example method, referredto as the “permutation scheme”, for constructing identifiers (e.g.,nucleic acid molecules) with permuted components (e.g., nucleic acidsequences). FIG. 11A illustrates the architecture of identifiersconstructed using the permutation scheme. An identifier may beconstructed by combining a single component from each layer in aprogrammable order. FIG. 11B illustrates an example of the combinatorialspace of identifiers that may be constructed using the permutationscheme. In an example, a combinatorial space of size six may begenerated from three layers each comprising one distinct component. Thecomponents may be concatenated in any order. In general, with M layers,each with N components, the permutation scheme enables a combinatorialspace of N^(M)M! total identifiers.

FIG. 11C illustrates an example implementation of the permutation schemewith template directed ligation (TDL, see Chemical Methods Section B).Components from multiple layers are assembled in between fixed left endand right end components, referred to as edge scaffolds. These edgescaffolds are the same for all identifiers in the combinatorial spaceand thus may be added as part of the reaction master mix for theimplementation. Templates or staples exist for any possible junctionbetween any two layers or scaffolds such that the order in whichcomponents from different layers are incorporated into an identifier inthe reaction depends on the templates selected for the reaction. Inorder to enable any possible permutation of layers for M layers, theremay be M²+2M distinct selectable staples for every possible junction(including junctions with the scaffolds). M of those templates (shadedin grey) form junctions between layers and themselves and may beexcluded for the purposes of permutation assembly as described herein.However, their inclusion can enable a larger combinatorial space withidentifiers comprising repeat components as illustrated in FIGS. 11D-G.DNA size selection (e.g., with gel extraction, see Chemical MethodsSection E) or polymerase chain reaction (PCR) with primers flanking theouter most layers (see Chemical Methods Section D) may be implemented toisolate identifier products from other byproducts that may form in thereaction. Sequential nucleic acid capture with two probes, one for eachof the two outermost layers, may also be implemented to isolateidentifier products from other byproducts that may form in the reaction(see Chemical Methods Section F).

FIGS. 11D-G illustrate example methods of how the permutation scheme maybe expanded to include certain instances of identifiers with repeatedcomponents. FIG. 11D shows an example of how the implementation formFIG. 11C may be used to construct identifiers with permuted and repeatedcomponents. For example, an identifier may comprise three totalcomponents assembled from two distinct components. In this example, acomponent from a layer may be present multiple times in an identifier.Adjacent concatenations of the same component may be achieved by using astaple with adjacent complementary hybridization regions for both the 3′end and 5′ end of the same component, such as the a*b* (5′ to 3′) staplein the figure. In general, for M layers, there are M such staples.Incorporation of repeated components with this implementation maygenerate nucleic acid sequences of more than one length (i.e.,comprising one, two, three, four, or more components) that are assembledbetween the edge scaffolds, as demonstrated in FIG. 11E. FIG. 11E showshow the example implementation from FIG. 11D may lead to non-targetednucleic acid sequences, besides the identifier, that are assembledbetween the edge scaffolds. The appropriate identifier cannot beisolated from non-targeted nucleic acid sequence with PCR because theyshare the same primer binding sites on the edge. However, in thisexample, DNA size selection (e.g., with gel extraction) may beimplemented to isolate the targeted identifier (e.g., the secondsequence from the top) from the non-targeted sequences since eachassembled nucleic acid sequence can be designed to have a unique length(e.g., if all components have the same length). See Chemical MethodsSection E about size-selection. FIG. 11F shows another example whereconstructing an identifier with repeated components may generatemultiple nucleic acid sequences with equal edge sequences but distinctlengths in the same reaction. In this method, templates that assemble acomponents in one layer with components in other layers in analternating pattern may be used. As with the method shown in FIG. 11E,size selection may be used to select identifiers of the designed length.FIG. 11G shows an example where constructing an identifier with repeatedcomponents may generate multiple nucleic acid sequences with equal edgesequences and for some nucleic acid sequences (e.g., the third andfourth from the top and the sixth and seventh from the top), equallengths. In this example, those nucleic acid sequences that share equallengths may be excluded from both being individual identifiers as it maynot be possible to construct one without also constructing the other,even if PCR and DNA size selection are implemented.

FIGS. 12A-12D schematically illustrate an example method, referred to asthe “MchooseK scheme”, for constructing identifiers (e.g., nucleic acidmolecules) with any number, K, of assembled components (e.g., nucleicacid sequences) out of a larger number, M, of possible components. FIG.12A illustrates the architecture of identifiers constructed using theMchooseK scheme. Using this method identifiers are constructed byassembling one component form each layer in any subset of all layers(e.g., choose components from k layers out of M possible layers). FIG.12B illustrates an example of the combinatorial space of identifiersthat may be constructed using the MchooseK scheme. In this assemblyscheme the combinatorial space may comprise N^(K)MchooseK possibleidentifiers for M layers, N components per layer, and an identifierlength of K components. In an example, if there are five layers eachcomprising one component, then up to ten distinct identifiers may beassemble comprising two components each.

The MchooseK scheme may be implemented using template directed ligation(See Chemical Methods Section B), as shown in FIG. 12C. As with the TDLimplementation for the permutation scheme (FIG. 11C), components in thisexample are assembled between edge scaffolds that may or may not beincluded in the reaction master mix. Components may be divided into Mlayers, for example M=4 layers with predefined rank from 2 to M, wherethe left edge scaffold may be rank 1 and the right edge scaffold may berank M+1. Templates comprise nucleic acid sequences for the 3′ to 5′ligation of any two components with lower rank to higher rank,respectively. There are ((M+1)²+M+1)/2 such templates. An individualidentifier of any K components from distinct layers may be constructedby combining those selected components in a ligation reaction with thecorresponding K+1 staples used to bring the K components together withthe edge scaffolds in their rank order. Such a reaction set up may yieldthe nucleic acid sequence corresponding to the target identifier betweenthe edge scaffolds. Alternatively, a reaction mix comprising alltemplates may be combined with the select components to assemble thetarget identifier. This alternative method may generate various nucleicacid sequences with the same edge sequences but distinct lengths (if allcomponent lengths are equal), as illustrated in FIG. 12D. The targetidentifier (bottom) may be isolated from byproduct nucleic acidsequences by size. See Chemical Methods Section E about nucleic acidsize-selection.

FIGS. 13A and 13B schematically illustrate an example method, referredto as the “partition scheme” for constructing identifiers withpartitioned components. FIG. 13A shows an example of the combinatorialspace of identifiers that may be constructed using the partition scheme.An individual identifier may be constructed by assembling one componentfrom each layer in a fixed order with the optional placement of anypartition (specially classified component) between any two components ofdifferent layers. For example, a set of components may be organized intoone partition component and four layers containing one component each. Acomponent from each layer may be combined in a fixed order and a singlepartition component may be assembled in various locations betweenlayers. An identifier in this combinatorial space may comprise nopartition components, a partition component between the components fromthe first and second layer, a partition between the components from thesecond and third layer, and so on to make a combinatorial space of eightpossible identifiers. In general, with M layers, each with N components,and p partition components, there are N^(K)(p+1)^(M-1) possibleidentifiers that may be constructed. This method may generateidentifiers of various lengths.

FIG. 13B shows an example implementation of the partition scheme usingtemplate directed ligation (See Chemical Methods Section B). Templatescomprise nucleic acid sequences for ligating together one component fromeach of M layers in a fixed order. For each partition component,additional pairs of templates exist that enable the partition componentto ligate in between the components from any two adjacent layers. Forexample a pair of templates such that one template (with sequence g*b*(5′ to 3′) for example) in a pair enables the 3′ end of layer 1 (withsequence b) to ligate to the 5′ end of the partition component (withsequence g) and such that the second template in the pair (with sequencec*h* (5′ to 3′) for example) enables the 3′ end of the partitioncomponent (with sequence h) to ligate to the 5′ end of layer 2 (withsequence c). To insert a partition between any two components ofadjacent layers, the standard template for ligating together thoselayers may be excluded in the reaction and the pair of templates forligating the partition in that position may be selected in the reaction.In the current example, targeting the partition component between layer1 and layer 2 may use the pair of templates c*h* (5′ to 3′) and g*b* (5′to 3′) to select for the reaction rather than the template c*b* (5′ to3′). Components may be assembled between edge scaffolds that may beincluded in the reaction mix (along with their corresponding templatesfor ligating to the first and Mth layers, respectively). In general, atotal of around M−1+2*p*(M−1) selectable templates may be used for thismethod for M layers and p partition components. This implementation ofthe partition scheme may generate various nucleic acid sequences in areaction with the same edge sequences but distinct lengths. The targetidentifier may be isolated from byproduct nucleic acid sequences by DNAsize selection. Specifically, there may be exactly one nucleic acidsequence product with exactly M layer components. If the layercomponents are designed large enough compared to the partitioncomponents, it may be possible to define a universal size selectionregion whereby the identifier (and none of the non-targeted byproducts)may be selected regardless of the particular partitioning of thecomponents within the identifier, thereby allowing for multiplepartitioned identifiers from multiple reactions to be isolated in thesame size selection step. See Chemical Methods Section E about nucleicacid size-selection.

FIGS. 14A and 14B schematically illustrates an example method, referredto as the “unconstrained string scheme” or “USS”, for constructingidentifiers made up of any string of components from a number ofpossible components. FIG. 14A shows an example of the combinatorialspace of 3-component (or 4-scaffold) length identifiers that may beconstructed using the unconstrained string scheme. The unconstrainedstring scheme constructs an individual identifier of length K componentswith one or more distinct components each taken from one or more layers,where each distinct component can appear at any of the K componentpositions in the identifier (allowing for repeats). For example, for twolayers, each comprising one component, there are eight possible3-component length identifiers. In general, with M layers, each with onecomponent, there are M K possible identifiers of length K components.FIG. 14B shows an example implementation of the unconstrained stringscheme using template directed ligation (see Chemical Methods SectionB). In this method, K+1 single-stranded and ordered scaffold DNAcomponents (including two edge scaffolds and K−1 internal scaffolds) arepresent in the reaction mix. An individual identifier comprises a singlecomponent ligated between every pair of adjacent scaffolds. For example,a component ligated between scaffolds A and B, a component ligatedbetween scaffolds C and D, and so on until all K adjacent scaffoldjunctions are occupied by a component. In a reaction, selectedcomponents from different layers are introduced to scaffolds along withselected pairs of staples that direct them to assemble onto theappropriate scaffolds. For example, the pair of staples a*L* (5′ to 3′)and A*b* (5′ to 3′) direct the layer 1 component with a 5′ end region‘a’ and 3′ end region ‘b’ to ligate in between the L and A scaffolds. Ingeneral, with M layers and K+1 scaffolds, 2*M*K selectable staples maybe used to construct any USS identifier of length K Because the staplesthat connect a component to a scaffold on the 5′ end are disjoint fromthe staples that connect the same component to a scaffold on the 3′ end,nucleic acid byproducts may form in the reaction with equal edgescaffolds as the target identifier, but with less than K components(less than K+1 scaffolds) or with more than K components (more than K+1scaffolds). The targeted identifier may form with exactly K components(K+1 scaffolds) and may therefore be selectable through techniques likeDNA size selection if all components are designed to be equal in lengthand all scaffolds are designed to be equal in length. See ChemicalMethods Section E on nucleic acid size selection. In certain embodimentsof the unconstrained string scheme where there may be one component perlayer, that component may solely comprise a single distinct nucleic acidsequence that fulfills all three roles of (1) an identification barcode,(2) a hybridization region for staple-mediated ligation of the 5′ end toa scaffold, and (3) a hybridization region for staple mediated ligationof the 3′ end to a scaffold.

The internal scaffolds illustrated in FIG. 14B may be designed such thatthey use the same hybridization sequence for both the staple-mediated 5′ligation of the scaffold to a component and the staple-mediated 3′ligation of the scaffold to another (not necessarily distinct)component. Thus the depicted one-scaffold, two-staple stackedhybridization events in FIG. 14B represent the statisticalback-and-forth hybridization events that occur between the scaffold andeach of the staples, thus enabling both 5′ component ligation and 3′component ligation. In other embodiments of the unconstrained stringscheme, the scaffold may be designed with two concatenated hybridizationregions—a distinct 3′ hybridization region for staple-mediated 3′ligation and a distinct 5′ hybridization region for staple-mediated 5′ligation.

FIGS. 15A and 15B schematically illustrate an example method, referredto as the “component deletion scheme”, for constructing identifiers bydeleting nucleic acid sequences (or components) from a parentidentifier. FIG. 15A shows an example of the combinatorial spaces ofpossible identifiers that may be constructed using the componentdeletion scheme. In this example, a parent identifier may comprisemultiple components. A parent identifier may comprise more than or equalto about 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 or more components.An individual identifier may be constructed by selectively deleting anynumber of components from N possible components, leading to a “full”combinatorial space of size 2N, or by deleting a fixed number of Kcomponents from N possible components, thus leading to an “NchooseK”combinatorial space of size NchooseK. In an example with a parentidentifier with 3 components, the full combinatorial space may be 8 andthe 3choose2 combinatorial space may be 3.

FIG. 15B shows an example implementation of the component deletionscheme using double stranded targeted cleavage and repair (DSTCR). Theparent sequence may be a single stranded DNA substrate comprisingcomponents flanked by nuclease-specific target sites (which can be 4 orless bases in length), and where the parent may be incubated with one ormore double-strand-specific nucleases corresponding to the target sites.An individual component may be targeted for deletion with acomplementary single stranded DNA (or cleavage template) that binds thecomponent DNA (and flanking nuclease sites) on the parent, thus forminga stable double stranded sequence on the parent that may be cleaved onboth ends by the nucleases. Another single stranded DNA (or repairtemplate) hybridizes to the resulting disjoint ends of the parent(between which the component sequence had been) and brings them togetherfor ligation, either directly or bridged by a replacement sequence, suchthat the ligated sequences on the parent no longer contain activenuclease-targeted sites. We refer to this method as “Double StrandedTargeted Cleavage” (DSTC). Size selection may be used to select foridentifiers with a certain number of deleted components. See ChemicalMethods Section E about nucleic acid size-selection.

Alternatively, or in addition to, the parent identifier may be a doubleor single stranded nucleic acid substrate comprising componentsseparated by spacer sequences such that no two components are flanked bythe same sequence. The parent identifier may be incubated with Cas9nuclease. An individual component may be targeted for deletion withguide ribonucleic acids (the cleavage templates) that bind to the edgesof the component and enable Cas9-mediated cleavage at its flankingsites. A single stranded nucleic acid (the repair template) mayhybridize to the resulting disjoint ends of the parent identifier (e.g.,between the ends where the component sequence had been), thus bringingthem together for ligation. Ligation may be done directly or by bridgingthe ends with a replacement sequence, such that the ligated sequences onthe parent no longer contain spacer sequences that can be targeted byCas9. We refer to this method as “sequence specific targeted cleavageand repair” or “SSTCR”.

Identifiers may be constructed by inserting components into a parentidentifier using a derivative of DSTCR. A parent identifier may besingle stranded nucleic acid substrate comprising nuclease-specifictarget sites (which can be 4 or less bases in length), each embeddedwithin a distinct nucleic acid sequence. The parent identifier may beincubated with one or more double-strand-specific nucleasescorresponding to the target sites. An individual target site on theparent identifier may be targeted for component insertion with acomplementary single stranded nucleic acid (the cleavage template) thatbinds the target site and the distinct surrounding nucleic acid sequenceon the parent identifier, thus forming a double stranded site. Thedouble-stranded site may be cleaved by a nuclease. Another singlestranded nucleic acid (the repair template) may hybridize to theresulting disjoint ends of the parent identifier and bring them togetherfor ligation, bridged by a component sequence, such that the ligatedsequences on the parent no longer contain active nuclease-targetedsites. Alternatively a derivative of SSTCR may be used to insertcomponents into a parent identifier. The parent identifier may be adouble or single-stranded nucleic acid and the parent may be incubatedwith a Cas9 nuclease. A distinct site on the parent identifier may betargeted for cleavage with a guide RNA (the cleavage template). A singlestranded nucleic acid (the repair template) may hybridize to thedisjoint ends of the parent identifier and bring them together forligation, bridged by a component sequence, such that the ligatedsequences on the parent identifier no longer contain activenuclease-targeted sites. Size selection may be used to select foridentifiers with a certain number of component insertions.

FIG. 16 schematically illustrates a parent identifier with recombinaserecognition sites. Recognition sites of different patterns can berecognized by different recombinases. All recognition sites for a givenset of recombinases are arranged such that the nucleic acids in betweenthem may be excised if the recombinase is applied. The nucleic acidstrand shown in FIG. 16 can adopt 2⁵=32 different sequences depending onthe subset of recombinases that are applied to it. In some embodiments,as depicted in FIG. 16 , unique molecules can be generated usingrecombinases to excise, shift, invert, and transpose segments of DNA tocreate different nucleic acid molecules. In general, with N recombinasesthere can be 2N possible identifiers built from a parent. In someembodiments, multiple orthogonal pairs of recognition sites fromdifferent recombinases may be arranged on a parent identifier in anoverlapping fashion such that the application of one recombinase affectsthe type of recombination event that occurs when a downstreamrecombinase is applied (see Roquet et al., Synthetic recombinase-basedstate machines in living cells, Science 353 (6297): aad8559 (2016),which is entirely incorporated herein by reference). Such a system maybe capable of constructing a different identifier for every ordering ofN recombinases, N!. Recombinases may be of the tyrosine family such asFlp and Cre, or of the large serine recombinase family such as PhiC31,BxbI, TP901, or A118. The use of recombinases from the large serinerecombinase family may be advantageous because they facilitateirreversible recombination and therefore may produce identifiers moreefficiently than other recombinases.

In some instances, a single nucleic acid sequence can be programmed tobecome many distinct nucleic acid sequences by applying numerousrecombinases in a distinct order. Approximately ˜e¹M! distinct nucleicacid sequences may be generated by applying M recombinases in differentsubsets and orders thereof, when the number of recombinases, M, may beless than or equal to 7 for the large serine recombinase family. Whenthe number of recombinases, M, may be greater than 7, the number ofsequences that can be produced approximates 3.9^(M), see e.g., Roquet etal., Synthetic recombinase-based state machines in living cells, Science353 (6297): aad8559 (2016), which is entirely incorporated herein byreference. Additional methods for producing different DNA sequences fromone common sequence can include targeted nucleic acid editing enzymessuch as CRISPR-Cas, TALENS, and Zinc Finger Nucleases. Sequencesproduced by recombinases, targeted editing enzymes or the like can beused in conjunction with any of the previous methods, for examplemethods disclosed in any of the figures and disclosure in the presentapplication.

If the bit-stream of information to be encoded is larger than that whichcan be encoded by any single nucleic acid molecule, then the informationcan be split and indexed with nucleic acid sequence barcodes. Moreover,any subset of size k nucleic acid molecules from the set of N nucleicacid molecules can be chosen to produce log 2 (Nchoosek) bits ofinformation. Barcodes may be assembled onto the nucleic acid moleculeswithin the subsets of size k to encode even longer bit streams. Forexample, M barcodes may be used to produce M*log₂ (Nchoosek) bits ofinformation. Given a number, N, of available nucleic acid molecules in aset and a number, M, of available barcodes, subsets of size k=k₀ may bechosen to minimize the total number of molecules in a pool to encode apiece of information. A method for encoding digital information cancomprise steps for breaking up the bit stream and encoding theindividual elements. For example, a bit stream comprising 6 bits can besplit into 3 components each component comprising two bits. Each two bitcomponent can be barcoded to form an information cassette, and groupedor pooled together to form a hyper-pool of information cassettes.

Barcodes can facilitate information indexing when the amount of digitalinformation to be encoded exceeds the amount that can fit in one poolalone. Information comprising longer strings of bits and/or multiplebytes can be encoded by layering the approach disclosed in FIG. 3 , forexample, by including a tag with unique nucleic acid sequences encodedusing the nucleic acid index. Information cassettes or identifierlibraries can comprise nitrogenous bases or nucleic acid sequences thatinclude unique nucleic acid sequences that provide location andbit-value information in addition to a barcode or tag which indicatesthe component or components of the bit stream that a given sequencecorresponds to. Information cassettes can comprise one or more uniquenucleic acid sequences as well as a barcode or tag. The barcode or tagon the information cassette can provide a reference for the informationcassette and any sequences included in the information cassette. Forexample, the tag or barcode on an information cassette can indicatewhich portion of the bit stream or bit component of the bit steam theunique sequence encodes information for (e.g., the bit value and bitposition information for).

Using barcodes, more information in bits can be encoded in a pool thanthe size of the combinatorial space of possible identifiers. A sequenceof 10 bits, for example, can be separated into two sets of bytes, eachbyte comprising 5 bits. Each byte can be mapped to a set of 5 possibledistinct identifiers. Initially, the identifiers generated for each bytecan be the same, but they may be kept in separate pools or else someonereading the information may not be able to tell which byte a particularnucleic acid sequence belongs to. However each identifier can bebarcoded or tagged with a label that corresponds to the byte for whichthe encoded information applies (e.g., barcode one may be attached tosequences in the nucleic acid pool to provide the first five bits andbarcode two may be attached to sequences in the nucleic acid pool toprovide the second five bits), and then the identifiers corresponding tothe two bytes can be combined into one pool (e.g., “hyper-pool” or oneor more identifier libraries). Each identifier library of the one ormore combined identifier libraries may comprise a distinct barcode thatidentifies a given identifier as belonging to a given identifierlibrary. Methods for adding a barcode to each identifier in anidentifier library can comprise using PCR, Gibson, ligation, or anyother approach that enables a given barcode (e.g., barcode 1) to attachto a given nucleic acid sample pool (e.g., barcode 1 to nucleic acidsample pool 1 and barcode 2 to nucleic acid sample pool 2). The samplefrom the hyper-pool can be read with sequencing methods, and sequencinginformation can be parsed using the barcode or tag. A method usingidentifier libraries and barcodes with a set of M barcodes and Npossible identifiers (the combinatorial space) can encode a stream ofbits with a length equivalent to the product of M and N.

In some embodiments, identifier libraries may be stored in an array ofwells. The array of wells may be defined as having n columns and q rowsand each well may comprise two or more identifier libraries in ahyper-pool. The information encoded in each well may constitute onelarge contiguous item of information of size n×q larger than theinformation contained in each of the wells. An aliquot may be taken fromone or more of the wells in the array of wells and the encoding may beread using sequencing, hybridization, or PCR.

A nucleic acid sample pool, hyper-pool, identifier library, group ofidentifier libraries, or a well, containing a nucleic acid sample poolor hyper-pool may comprise unique nucleic acid molecules (e.g.,identifiers) corresponding to bits of information and a plurality ofsupplemental nucleic acid sequences. The supplemental nucleic acidsequences may not correspond to encoded data (e.g., do not correspond toa bit value). The supplemental nucleic acid samples may mask or encryptthe information stored in the sample pool. The supplemental nucleic acidsequences may be derived from a biological source or syntheticallyproduced. Supplemental nucleic acid sequences derived from a biologicalsource may include randomly fragmented nucleic acid sequences orrationally fragmented sequences. The biologically derived supplementalnucleic acids may hide or obscure the data-containing nucleic acidswithin the sample pool by providing natural genetic information alongwith the synthetically encoded information, especially if thesynthetically encoded information (e.g., the combinatorial space ofidentifiers) is made to resemble natural genetic information (e.g., afragmented genome). In an example, the identifiers are derived from abiological source and the supplemental nucleic acids are derived from abiological source. A sample pool may contain multiple sets ofidentifiers and supplemental nucleic acid sequences. Each set ofidentifiers and supplemental nucleic acid sequences may be derived fromdifferent organisms. In an example, the identifiers are derived from oneor more organisms and the supplemental nucleic acid sequences arederived from a single, different organism. The supplemental nucleic acidsequences may also be derived from one or more organism and theidentifiers may be derived from a single organism that is different fromthe organism that the supplemental nucleic acids are derived from. Boththe identifiers and the supplemental nucleic acid sequences may bederived from multiple different organisms. A key may be used todistinguish the identifiers from the supplemental nucleic acidsequences.

The supplemental nucleic acid sequences may store metadata about thewritten information. The metadata may comprise extra information fordetermining and/or authorizing the source of the original informationand or the intended recipient of the original information. The metadatamay comprise extra information about the format of the originalinformation, the instruments and methods used to encode and write theoriginal information, and the date and time of writing the originalinformation into the identifiers. The metadata may comprise additionalinformation about the format of the original information, theinstruments and methods used to encode and write the originalinformation, and the date and time of writing the original informationinto nucleic acid sequences. The metadata may comprise additionalinformation about modifications made to the original information afterwriting the information into nucleic acid sequences. The metadata maycomprise annotations to the original information or one or morereferences to external information. Alternatively, or in addition to,the metadata may be stored in one or more barcodes or tags attached tothe identifiers.

The identifiers in an identifier pool may have the same, similar, ordifferent lengths than one another. The supplemental nucleic acidsequences may have a length that is less than, substantially equal to,or greater than the length of the identifiers. The supplemental nucleicacid sequences may have an average length that is within one base,within two bases, within three bases, within four bases, within fivebases, within six bases, within seven bases, within eight bases, withinnine bases, within ten bases, or within more bases of the average lengthof the identifiers. In an example, the supplemental nucleic acidsequences are the same or substantially the same length as theidentifiers. The concentration of supplemental nucleic acid sequencesmay be less than, substantially equal to, or greater than theconcentration of the identifiers in the identifiers library. Theconcentration of the supplemental nucleic acids may be less than orequal to about 1%, 10%, 20%, 40%, 60%, 80%, 100, %, 125%, 150%, 175%,200%, 1000%, 1×10⁴%, 1×10⁵%, 1×10⁶%, 1×10⁷%, 1×10⁸% or less than theconcentration of the identifiers. The concentration of the supplementalnucleic acids may be greater than or equal to about 1%, 10%, 20%, 40%,60%, 80%, 100, %, 125%, 150%, 175%, 200%, 1000%, 1×10⁴%, 1×10⁵%, 1×10⁶%,1×10⁷%, 1×10⁸% or more than the concentration of the identifiers. Largerconcentrations may be beneficial for obfuscation or concealing data. Inan example, the concentration of the supplemental nucleic acid sequencesare substantially greater (e.g., 1×10⁸% greater) than the concentrationof identifiers in an identifier pool.

Methods for Copying and Accessing Data Stored in Nucleic Acid Sequences

In another aspect, the present disclosure provides methods for copying(or replicating) information encoded in nucleic acid sequence(s). Amethod for copying information encoded in nucleic acid sequence(s) maycomprise (a) providing an identifier library and (b) constructing one ormore copies of the identifier library. An identifier library maycomprise a subset of a plurality of identifiers from a largercombinatorial space. Each individual identifier of the plurality ofidentifiers may correspond to an individual symbol in a string ofsymbols. An identifier may comprise one or more components. A componentmay comprise a nucleic acid sequence.

In another aspect, the present disclosure provides methods for accessinginformation encoded in nucleic acid sequences. A method for accessinginformation encoded in nucleic acid sequences may comprise (a) providingan identifier library, and (b) extracting a portion or a subset of theidentifiers present in the identifier library from the identifierlibrary. An identifier library may comprise a subset of a plurality ofidentifiers from a larger combinatorial space. Each individualidentifier of the plurality of identifiers may correspond to anindividual symbol in a string of symbols. An identifier may comprise oneor more components. A component may comprise a nucleic acid sequence.

Information may be written into one or more identifier libraries asdescribed elsewhere herein. Identifiers may be constructed using anymethod described elsewhere herein. Stored data may be copied bygenerating copies of the individual identifiers in an identifier libraryor in one or more identifier libraries. A portion of the identifiers maybe copied or an entire library may be copied. Copying may be performedby amplifying the identifiers in an identifier library. When one or moreidentifier libraries are combined, a single identifier library ormultiple identifier libraries may be copied. If an identifier librarycomprises supplemental nucleic acid sequences, the supplemental nucleicacid sequences may or may not be copied.

Identifiers in an identifier library may be constructed to comprise oneor more common primer binding sites. The one or more binding sites maybe located at the edges of each identifier or interweaved throughouteach identifier. The primer binding site may allow for an identifierlibrary specific primer pair or a universal primer pair to bind to andamplify the identifiers. All the identifiers within an identifierlibrary or all the identifiers in one or more identifier libraries maybe replicated multiple times by multiple PCR cycles. Conventional PCRmay be used to copy the identifiers and the identifiers may beexponentially replicated with each PCR cycle. The number of copies of anidentifier may increase exponentially with each PCR cycle. Linear PCRmay be used to copy the identifiers and the identifiers may be linearlyreplicated with each PCR cycle. The number of identifier copies mayincrease linearly with each PCR cycle. The identifiers may be ligatedinto a circular vector prior to PCR amplification. The circle vector maycomprise a barcode at each end of the identifier insertion site. The PCRprimers for amplifying identifiers may be designed to prime to thevector such that the barcoded edges are included with the identifier inthe amplification product. During amplification, recombination betweenidentifiers may result in copied identifiers that comprisenon-correlated barcodes on each edge. The non-correlated barcodes may bedetectable upon reading the identifiers. Identifiers containingnon-correlated barcodes may be considered false positives and may bedisregarded during the information decoding process. See ChemicalMethods Section D.

Information may be encoded by assigning each bit of information to aunique nucleic acid molecule. For example, three sample sets (X, Y, andZ) each containing two nucleic acid sequences may assemble into eightunique nucleic acid molecules and encode eight bits of data:

-   -   N1=X1Y1Z1    -   N2=X1Y1Z2    -   N3=X1Y2Z1    -   N4=X1Y2Z2    -   N5=X2Y1Z1    -   N6=X2Y1Z2    -   N7=X2Y2Z1    -   N8=X2Y2Z2

Each bit in a string may then be assigned to the corresponding nucleicacid molecule (e.g., N1 may specify the first bit, N2 may specify thesecond bit, N3 may specify the third bit, and so forth). The entire bitstring may be assigned to a combination of nucleic acid molecules wherethe nucleic acid molecules corresponding to bit-values of ‘1’ areincluded in the combination or pool. For example, in UTF-8 codings, theletter ‘K’ may be represented by the 8-bit string code 01001011 whichmay be encoded by the presence of four nucleic acid molecules (e.g.,X1Y1Z2, X2Y1Z1, X2Y2Z1, and X2Y2Z2 in the above example).

The information may be accessed through sequencing or hybridizationassays. For example, primers or probes may be designed to bind to commonregions or the barcoded region of the nucleic acid sequence. This mayenable amplification of any region of the nucleic acid molecule. Theamplification product may then be read by sequencing the amplificationproduct or by a hybridization assay. In the above example encoding theletter ‘K’, if the first half of the data is of interest a primerspecific to the barcode region of the X1 nucleic acid sequence and aprimer that binds to the common region of the Z set may be used toamplify the nucleic acid molecules. This may return the sequence Y1Z2,which may encode for 0100. The substring of that data may also beaccessed by further amplifying the nucleic acid molecules with a primerthat binds to the barcode region of the Y1 nucleic acid sequence and aprimer that binds to the common sequence of the Z set. This may returnthe Z2 nucleic acid sequence, encoding the substring 01. Alternatively,the data may be accessed by checking for the presence or absence of aparticular nucleic acid sequence without sequencing. For example,amplification with a primer specific to the Y2 barcode may generateamplification products for the Y2 barcode, but not for the Y1 barcode.The presence of Y2 amplification product may signal a bit value of ‘1’.Alternatively, the absence of Y2 amplification products may signal a bitvalue of ‘0’.

PCR based methods can be used to access and copy data from identifier ornucleic acid sample pools. Using common primer binding sites that flankthe identifiers in the pools or hyper-pools, nucleic acids containinginformation can be readily copied. Alternatively, other nucleic acidamplification approaches such as isothermal amplification may also beused to readily copy data from sample pools or hyper-pools (e.g.,identifier libraries). See Chemical Methods Section D on nucleic acidamplification. In instances where the sample comprises hyper-pools, aparticular subset of information (e.g., all nucleic acids relating to aparticular barcode) can be accessed and retrieved by using a primer thatbinds the specific barcode at one edge of the identifier in the forwardorientation, along with another primer that binds a common sequence onthe opposite edge of the identifier in a reverse orientation. Thisprocess can be repeated multiple times to access sub-pools fromsub-pools of identifiers (for example, all nucleic acids with two ormore particular barcodes). For example, by using nested PCR, first witha primer that bind to a particular barcode on one edge, and then againwith a particular primer that binds to a particular barcode one removedfrom said edge, and then again with a particular primer that binds to abarcode two removed from said edge, and so on. Various read-out methodscan be used to pull information from the encoded nucleic acid; forexample microarray (or any sort of fluorescent hybridization), digitalPCR, quantitative PCR (qPCR), and various sequencing platforms can befurther used to read out the encoded sequences and by extensiondigitally encoded data.

Accessing information stored in nucleic acid molecules (e.g.,identifiers) may be performed by selectively removing the portion ofnon-targeted identifiers from an identifier library or a pool ofidentifiers or, for example, selectively removing all identifiers of anidentifier library from a pool of multiple identifier libraries.Accessing data may also be performed by selectively capturing targetedidentifiers from an identifier library or pool of identifiers. Thetargeted identifiers may correspond to data of interest within thelarger item of information. A pool of identifiers may comprisesupplemental nucleic acid molecules. The supplemental nucleic acidmolecules may contain metadata about the encoded information or may beused to encrypt or mask the identifiers corresponding to theinformation. The supplemental nucleic acid molecules may or may not beextracted while accessing the targeted identifiers. FIGS. 17A-17Cschematically illustrate an overview of example methods for accessingportions of information stored in nucleic acid sequences by accessing anumber of particular identifiers from a larger number of identifiers.FIG. 17A shows example methods for using polymerase chain reaction,affinity tagged probes, and degradation targeting probes to accessidentifiers containing a specified component. For PCR-based access, apool of identifiers (e.g., identifier library) may comprise identifierswith a common sequence at each end, a variable sequence at each end, orone of a common sequence or a variable sequence at each end. The commonsequences or variable sequences may be primer binding sites. One or moreprimers may bind to the common or variable regions on the identifieredges. The identifiers with primers bound may be amplified by PCR. Theamplified identifiers may significantly outnumber the non-amplifiedidentifiers. During reading, the amplified identifiers may beidentified. An identifier from an identifier library may comprisesequences on one or both of its ends that are distinct to that library,thus enabling a single library to be selectively accessed from a pool orgroup of more than one identifier libraries.

For affinity-tag based access, a process which may be referred to asnucleic acid capture, the components that constitute the identifiers ina pool may share complementarity with one or more probes. The one ormore probes may bind or hybridize to the identifiers to be accessed. Theprobe may comprise an affinity tag. The affinity tags may bind to abead, generating a complex comprising a bead, at least one probe, and atleast one identifier. The beads may be magnetic, and together with amagnet, the beads may collect and isolate the identifiers to beaccessed. The identifiers may be removed from the beads under denaturingconditions prior to reading. Alternatively, or in addition to, the beadsmay collect the non-targeted identifiers and sequester them away fromthe rest of the pool that can get washed into a separate vessel andread. The affinity tag may bind to a column. The identifiers to beaccessed may bind to the column for capture. Column-bound identifiersmay subsequently be eluted or denatured from the column prior toreading. Alternatively, the non-targeted identifiers may be selectivelytargeted to the column while the targeted identifiers may flow throughthe column Accessing the targeted identifiers may comprise applying oneor more probes to a pool of identifiers simultaneously or applying oneor more probes to a pool of identifiers sequentially. See ChemicalMethods Section F on nucleic acid capture.

For degradation based access, the components that constitute theidentifiers in a pool may share complementarity with one or moredegradation-targeting probes. The probes may bind to or hybridize withdistinct components on the identifiers. The probe may be a target for adegradation enzyme, such as an endonuclease. In an example, one or moreidentifier libraries may be combined. A set of probes may hybridize withone of the identifier libraries. The set of probes may comprise RNA andthe RNA may guide a Cas9 enzyme. A Cas9 enzyme may be introduced to theone or more identifier libraries. The identifiers hybridized with theprobes may be degraded by the Cas9 enzyme. The identifiers to beaccessed may not be degraded by the degradation enzyme. In anotherexample, the identifiers may be single-stranded and the identifierlibrary may be combined with a single-strand specific endonuclease(s),such as the 51 nuclease, that selectively degrades identifiers that arenot to be accessed. Identifiers to be accessed may be hybridized with acomplementary set of identifiers to protect them from degradation by thesingle-strand specific endonuclease(s). The identifiers to be accessedmay be separated from the degradation products by size selection, suchas size selection chromatography (e.g., agarose gel electrophoresis).Alternatively, or in addition, identifiers that are not degraded may beselectively amplified (e.g., using PCR) such that the degradationproducts are not amplified. The non-degraded identifiers may beamplified using primers that hybridize to each end of the non-degradedidentifiers and therefore not to each end of the degraded or cleavedidentifiers.

FIG. 17B shows example methods for using polymerase chain reaction toperform ‘OR’ or ‘AND’ operations to access identifiers containingmultiple components. In an example, if two forward primers bind distinctsets of identifiers on the left end, then an ‘OR’ amplification of theunion of those sets of identifiers may be accomplished by using the twoforward primers together in a multiplex PCR reaction with a reverseprimer that binds all of the identifiers on the right end. In anotherexample, if one forward primer binds a set of identifiers on the leftend and one reverse primer binds a set of identifiers on the right end,then an ‘AND’ amplification of the intersection of those two sets ofidentifiers may be accomplished by using the forward primer and thereverse primer together as a primer pair in a PCR reaction. This processmay be repeated in a sequential fashion (e.g., nested PCR) to accessidentifier sub-pools with any number of components in common.

With each iteration of PCR-based access on an identifier library, theidentifiers may become shorter as primers are designed to bindcomponents iteratively further inward from each edge. For example, anidentifier library may comprise identifiers of the form A-B-C-D-E-F-G,where A, B, C, D, E, F, and G are layers. Upon amplifying with primersthat bind particular components, for example, A₁ and G₁ in layers A andG respectively, the amplified portion of the identifier library may takeon the form A₁-B-C-D-E-F-G₁. Upon further amplifying with primers thatbind particular components, for example, B₁ and F₁ in layers B and Frespectively, the amplified portion of the identifier library may takeon the form B₁-C-D-E-F₁, where it may be assumed that these shorteramplified sequences correspond to full identifiers that further comprisecomponent A₁ in the position of layer A and G₁ in the position of layerG.

FIG. 17C shows example methods for using affinity tags to perform ‘OR’or ‘AND’ operations to access identifiers containing multiplecomponents. In an example, if affinity probe ‘P1’ captures allidentifiers with component ‘C1’ and another affinity probe ‘P2’ capturesall identifiers with component ‘C2’, then the set of all identifierswith C1 or C2 can be captured by using P1 and P2 simultaneously(corresponding to an ‘OR’ operation). In another example with the samecomponents and probes, the set of all identifiers with C1 and C2 can becaptures by using P1 and P2 sequentially (corresponding to an ‘AND’operation).

Methods for Reading Information Stored in Nucleic Acid Sequences

In another aspect, the present disclosure provides methods for readinginformation encoded in nucleic acid sequences. A method for readinginformation encoded in nucleic acid sequences may comprise (a) providingan identifier library, (b) identifying the identifiers present in theidentifier library, (c) generating a string of symbols from theidentifiers present in the identifier library, and (d) compilinginformation from the string of symbols. An identifier library maycomprise a subset of a plurality of identifiers from a combinatorialspace. Each individual identifier of the subset of identifiers maycorrespond to an individual symbol in a string of symbols. An identifiermay comprise one or more components. A component may comprise a nucleicacid sequence.

Information may be written into one or more identifier libraries asdescribed elsewhere herein. Identifiers may be constructed using anymethod described elsewhere herein. Stored data may be copied andaccessed using any method described elsewhere herein.

The identifier may comprise information relating to a location of theencoded symbol, a value of the encoded symbol, or both the location andthe value of the encoded symbol. An identifier may include informationrelating to a location of the encoded symbol and the presence or absenceof the identifier in an identifier library may indicate the value of thesymbol. The presence of an identifier in an identifier library mayindicate a first symbol value (e.g., first bit value) in a binary stringand the absence of an identifier in an identifier library may indicate asecond symbol value (e.g., second bit value) in a binary string. In abinary system, basing a bit value on the presence or absence of anidentifier in an identifier library may reduce the number of identifiersassembled and, therefore, reduce the write time. In an example, thepresence of an identifier may indicate a bit value of ‘1’ at the mappedlocation and the absence of an identifier may indicate a bit value of‘0’ at the mapped location.

Generating symbols (e.g., bit values) for a piece of information mayinclude identifying the presence or absence of the identifier that thesymbol (e.g., bit) may be mapped or encoded to. Determining the presenceor absence of an identifier may include sequencing the presentidentifiers or using a hybridization array to detect the presence of anidentifier. In an example, decoding and reading the encoded sequencesmay be performed using sequencing platforms. Examples of sequencingplatforms are described in U.S. patent application Ser. No. 14/465,685filed Aug. 21, 2014, U.S. patent application Ser. No. 13/886,234 filedMay 2, 2013, and U.S. patent application Ser. No. 12/400,593 filed Mar.9, 2009, each of which is entirely incorporated herein by reference.

In an example, decoding nucleic acid encoded data may be achieved bybase-by-base sequencing of the nucleic acid strands, such as Illumina®Sequencing, or by utilizing a sequencing technique that indicates thepresence or absence of specific nucleic acid sequences, such asfragmentation analysis by capillary electrophoresis. The sequencing mayemploy the use of reversible terminators. The sequencing may employ theuse of natural or non-natural (e.g., engineered) nucleotides ornucleotide analogs. Alternatively or in addition to, decoding nucleicacid sequences may be performed using a variety of analyticaltechniques, including but not limited to, any methods that generateoptical, electrochemical, or chemical signals. A variety of sequencingapproaches may be used including, but not limited to, polymerase chainreaction (PCR), digital PCR, Sanger sequencing, high-throughputsequencing, sequencing-by-synthesis, single-molecule sequencing,sequencing-by-ligation, RNA-Seq (Illumina), Next generation sequencing,Digital Gene Expression (Helicos), Clonal Single MicroArray (Solexa),shotgun sequencing, Maxim-Gilbert sequencing, or massively-parallelsequencing.

Various read-out methods can be used to pull information from theencoded nucleic acid. In an example, microarray (or any sort offluorescent hybridization), digital PCR, quantitative PCR (qPCR), andvarious sequencing platforms can be further used to read out the encodedsequences and by extension digitally encoded data.

An identifier library may further comprise supplemental nucleic acidsequences that provide metadata about the information, encrypt or maskthe information, or that both provide metadata and mask the information.The supplemental nucleic acids may be identified simultaneously withidentification of the identifiers. Alternatively, the supplementalnucleic acids may be identified prior to or after identifying theidentifiers. In an example, the supplemental nucleic acids are notidentified during reading of the encoded information. The supplementalnucleic acid sequences may be indistinguishable from the identifiers. Anidentifier index or a key may be used to differentiate the supplementalnucleic acid molecules from the identifiers.

The efficiency of encoding and decoding data may be increased byrecoding input bit strings to enable the use of fewer nucleic acidmolecules. For example, if an input string is received with a highoccurrence of ‘111’ substrings, which may map to three nucleic acidmolecules (e.g., identifiers) with an encoding method, it may be recodedto a ‘000’ substring which may map to a null set of nucleic acidmolecules. The alternate input substring of ‘000’ may also be recoded to‘111’. This method of recoding may reduce the total amount of nucleicacid molecules used to encode the data because there may be a reductionin the number of ‘1’s in the dataset. In this example, the total size ofthe dataset may be increased to accommodate a codebook that specifiesthe new mapping instructions. An alternative method for increasingencoding and decoding efficiency may be to recode the input string toreduce the variable length. For example, ‘111’ may be recoded to ‘00’which may shrink the size of the dataset and reduce the number of ‘1’sin the dataset.

The speed and efficiency of decoding nucleic acid encoded data may becontrolled (e.g., increased) by specifically designing identifiers forease of detection. For example, nucleic acid sequences (e.g.,identifiers) that are designed for ease of detection may include nucleicacid sequences comprising a majority of nucleotides that are easier tocall and detect based on their optical, electrochemical, chemical, orphysical properties. Engineered nucleic acid sequences may be eithersingle or double stranded. Engineered nucleic acid sequences may includesynthetic or unnatural nucleotides that improve the detectableproperties of the nucleic acid sequence. Engineered nucleic acidsequences may comprise all natural nucleotides, all synthetic orunnatural nucleotides, or a combination of natural, synthetic, andunnatural nucleotides. Synthetic nucleotides may include nucleotideanalogues such as peptide nucleic acids, locked nucleic acids, glycolnucleic acids, and threose nucleic acids. Unnatural nucleotides mayinclude dNaM, an artificial nucleoside containing a 3-methoxy-2-naphthlygroup, and d5SICS, an artificial nucleoside containing a6-methylisoquinoline-1-thione-2-yl group. Engineered nucleic acidsequences may be designed for a single enhanced property, such asenhanced optical properties, or the designed nucleic acid sequences maybe designed with multiple enhanced properties, such as enhanced opticaland electrochemical properties or enhanced optical and chemicalproperties. See Chemical Methods Section H on DNA design.

Engineered nucleic acid sequences may comprise reactive natural,synthetic, and unnatural nucleotides that do not improve the optical,electrochemical, chemical, or physical properties of the nucleic acidsequences. The reactive components of the nucleic acid sequences mayenable the addition of a chemical moiety that confers improvedproperties to the nucleic acid sequence. Each nucleic acid sequence mayinclude a single chemical moiety or may include multiple chemicalmoieties. Example chemical moieties may include, but are not limited to,fluorescent moieties, chemiluminescent moieties, acidic or basicmoieties, hydrophobic or hydrophilic moieties, and moieties that alteroxidation state or reactivity of the nucleic acid sequence.

A sequencing platform may be designed specifically for decoding andreading information encoded into nucleic acid sequences. The sequencingplatform may be dedicated to sequencing single or double strandednucleic acid molecules. The sequencing platform may decode nucleic acidencoded data by reading individual bases (e.g., base-by-base sequencing)or by detecting the presence or absence of an entire nucleic acidsequence (e.g., component) incorporated within the nucleic acid molecule(e.g., identifier). The sequencing platform may include the use ofpromiscuous reagents, increased read lengths, and the detection ofspecific nucleic acid sequences by the addition of detectable chemicalmoieties. The use of more promiscuous reagents during sequencing mayincrease reading efficiency by enabling faster base calling which inturn may decrease the sequencing time. The use of increased read lengthsmay enable longer sequences of encoded nucleic acids to be decoded perread. The addition of detectable chemical moiety tags may enable thedetection of the presence or absence of a nucleic acid sequence by thepresence or absence of a chemical moiety. For example, each nucleic acidsequence encoding a bit of information may be tagged with a chemicalmoiety that generates a unique optical, electrochemical, or chemicalsignal. The presence or absence of that unique optical, electrochemical,or chemical signal may indicate a ‘0’ or a ‘1’ bit value. The nucleicacid sequence may comprise a single chemical moiety or multiple chemicalmoieties. The chemical moiety may be added to the nucleic acid sequenceprior to use of the nucleic acid sequence to encode data. Alternativelyor in addition to, the chemical moiety may be added to the nucleic acidsequence after encoding the data, but prior to decoding the data. Thechemical moiety tag may be added directly to the nucleic acid sequenceor the nucleic acid sequence may comprise a synthetic or unnaturalnucleotide anchor and the chemical moiety tag may be added to thatanchor.

Unique codes may be applied to minimize or detect encoding and decodingerrors. Encoding and decoding errors may occur from false negatives(e.g., a nucleic acid molecule or identifier not included in a randomsampling). An example of an error detecting code may be a checksumsequence that counts the number of identifiers in a contiguous set ofpossible identifiers that is included in the identifier library. Whilereading the identifier library, the checksum may indicate how manyidentifiers from that contiguous set of identifiers to expect toretrieve, and identifiers can continue to be sampled for reading untilthe expected number is met. In some embodiments, a checksum sequence maybe included for every contiguous set of R identifiers where R can beequal in size or greater than 1, 2, 5, 10, 50, 100, 200, 500, or 1000 orless than 1000, 500, 200, 100, 50, 10, 5, or 2. The smaller the value ofR, the better the error detection. In some embodiments, the checksumsmay be supplemental nucleic acid sequences. For example, a setcomprising seven nucleic acid sequences (e.g., components) may bedivided into two groups, nucleic acid sequences for constructingidentifiers with a product scheme (components X1-X3 in layer X and Y1-Y3in layer Y), and nucleic acid sequences for the supplemental checksums(X4-X7 and Y4-Y7). The checksum sequences X4-X7 may indicate whetherzero, one, two, or three sequences of layer X are assembled with eachmember of layer Y. Alternatively, the checksum sequences Y4-Y7 mayindicate whether zero, one, two, or three sequences of layer Y areassembled with each member of layer X. In this example, an originalidentifier library with identifiers {X1Y1, X1Y3, X2Y1, X2Y2, X2Y3} maybe supplemented to include checksums to become the following pool:{X1Y1, X1Y3, X2Y1, X2Y2, X2Y3, X1Y6, X2Y7, X3Y4, X6Y1, X5Y2, X6Y3}. Thechecksum sequences may also be used for error correction. For example,absence of X1Y1 from the above dataset and the presence of X1Y6 and X6Y1may enable inference that the X1Y1 nucleic acid molecule is missing fromthe dataset. The checksum sequences may indicate whether identifiers aremissing from a sampling of the identifier library or an accessed portionof the identifier library. In the case of a missing checksum sequence,access methods such as PCR or affinity tagged probe hybridization mayamplify and/or isolate it. In some embodiments, the checksums may not besupplemental nucleic acid sequences. They checksums may be codeddirectly into the information such that they are represented byidentifiers.

Noise in data encoding and decoding may be reduced by constructingidentifiers palindromically, for example, by using palindromic pairs ofcomponents rather than single components in the product scheme. Then thepairs of components from different layers may be assembled to oneanother in a palindromic manner (e.g., YXY instead of XY for componentsX and Y). This palindromic method may be expanded to larger numbers oflayers (e.g., ZYXYZ instead of XYZ) and may enable detection oferroneous cross reactions between identifiers.

Adding supplemental nucleic acid sequences in excess (e.g., vast excess)to the identifiers may prevent sequencing from recovering the encodedidentifiers. Prior to decoding the information, the identifiers may beenriched from the supplemental nucleic acid sequences. For example, theidentifiers may be enriched by a nucleic acid amplification reactionusing primers specific to the identifier ends. Alternatively, or inaddition to, the information may be decoded without enriching the samplepool by sequencing (e.g., sequencing by synthesis) using a specificprimer. In both decoding methods, it may be difficult to enrich ordecode the information without having a decoding key or knowingsomething about the composition of the identifiers. Alternative accessmethods may also be employed such as using affinity tag based probes.

Systems for Encoding Binary Sequence Data

A system for encoding digital information into nucleic acids (e.g., DNA)can comprise systems, methods and devices for converting files and data(e.g., raw data, compressed zip files, integer data, and other forms ofdata) into bytes and encoding the bytes into segments or sequences ofnucleic acids, typically DNA, or combinations thereof.

In an aspect, the present disclosure provides systems for encodingbinary sequence data using nucleic acids. A system for encoding binarysequence data using nucleic acids may comprise a device and one or morecomputer processors. The device may be configured to construct anidentifier library. The one or more computer processors may beindividually or collectively programmed to (i) translate the informationinto a sting of symbols, (ii) map the string of symbols to the pluralityof identifiers, and (iii) construct an identifier library comprising atleast a subset of a plurality of identifiers. An individual identifierof the plurality of identifiers may correspond to an individual symbolof the string of symbols. An individual identifier of the plurality ofidentifiers may comprise one or more components. An individual componentof the one or more components may comprise a nucleic acid sequence.

In another aspect, the present disclosure provides systems for readingbinary sequence data using nucleic acids. A system for reading binarysequence data using nucleic acids may comprise a database and one ormore computer processors. The database may store an identifier libraryencoding the information. The one or more computer processors may beindividually or collectively programmed to (i) identify the identifiersin the identifier library, (ii) generate a plurality of symbols fromidentifiers identified in (i), and (iii) compile the information fromthe plurality of symbols. The identifier library may comprise a subsetof a plurality of identifiers. Each individual identifier of theplurality of identifiers may correspond to an individual symbol in astring of symbols. An identifier may comprise one or more components. Acomponent may comprise a nucleic acid sequence.

Non-limiting embodiments of methods for using the system to encodedigital data can comprise steps for receiving digital information in theform of byte streams. Parsing the byte streams into individual bytes,mapping the location of a bit within the byte using a nucleic acid index(or identifier rank), and encoding sequences corresponding to either bitvalues of 1 or bit values of 0 into identifiers. Steps for retrievingdigital data can comprise sequencing a nucleic acid sample or nucleicacid pool comprising sequences of nucleic acid (e.g., identifiers) thatmap to one or more bits, referencing an identifier rank to confirm ifthe identifier is present in the nucleic acid pool and decoding thelocation and bit-value information for each sequence into a bytecomprising a sequence of digital information.

Systems for encoding, writing, copying, accessing, reading, and decodinginformation encoded and written into nucleic acid molecules may be asingle integrated unit or may be multiple units configured to executeone or more of the aforementioned operations. A system for encoding andwriting information into nucleic acid molecules (e.g., identifiers) mayinclude a device and one or more computer processors. The one or morecomputer processors may be programmed to parse the information intostrings of symbols (e.g., strings of bits). The computer processor maygenerate an identifier rank. The computer processor may categorize thesymbols into two or more categories. One category may include symbols tobe represented by a presence of the corresponding identifier in theidentifier library and the other category may include symbols to berepresented by an absence of the corresponding identifiers in theidentifier library. The computer processor may direct the device toassemble the identifiers corresponding to symbols to be represented tothe presence of an identifier in the identifier library.

The device may comprise a plurality regions, sections, or partitions.The reagents and components to assemble the identifiers may be stored inone or more regions, sections, or partitions of the device. Layers maybe stored in separate regions of section of the device. A layer maycomprise one or more unique components. The component in one layer maybe unique from the components in another layer. The regions or sectionsmay comprise vessels and the partitions may comprise wells. Each layermay be stored in a separate vessel or partition. Each reagent or nucleicacid sequence may be stored in a separate vessel or partition.Alternatively, or in addition to, reagents may be combined to form amaster mix for identifier construction. The device may transferreagents, components, and templates from one section of the device to becombined in another section. The device may provide the conditions forcompleting the assembly reaction. For example, the device may provideheating, agitation, and detection of reaction progress. The constructedidentifiers may be directed to undergo one or more subsequent reactionsto add barcodes, common sequences, variable sequences, or tags to one ormore ends of the identifiers. The identifiers may then be directed to aregion or partition to generate an identifier library. One or moreidentifier libraries may be stored in each region, section, orindividual partition of the device. The device may transfer fluid (e.g.,reagents, components, templates) using pressure, vacuum, or suction.

The identifier libraries may be stored in the device or may be moved toa separate database. The database may comprise one or more identifierlibraries. The database may provide conditions for long term storage ofthe identifier libraries (e.g., conditions to reduce degradation ofidentifiers). The identifier libraries may be stored in a powder,liquid, or solid form. Aqueous solutions of identifiers may belyophilized for more stable storage (see Chemical Methods Section G formore information about lyophilization). The database may provideUltra-Violet light protection, reduced temperature (e.g., refrigerationor freezing), and protection from degrading chemicals and enzymes. Priorto being transferred to a database, the identifier libraries may belyophilized or frozen. The identifier libraries may includeethylenediaminetetraacetic acid (EDTA) to inactivate nucleases and/or abuffer to maintain the stability of the nucleic acid molecules.

The database may be coupled to, include, or be separate from a devicethat writes the information into identifiers, copies the information,accesses the information, or reads the information. A portion of anidentifier library may be removed from the database prior to copying,accessing or reading. The device that copies the information from thedatabase may be the same or a different device from that which writesthe information. The device that copies the information may extract analiquot of an identifier library from the device and combine thataliquot with the reagents and constituents to amplify a portion of orthe entire identifier library. The device may control the temperature,pressure, and agitation of the amplification reaction. The device maycomprise partitions and one or more amplification reaction may occur inthe partition comprising the identifier library. The device may copymore than one pool of identifiers at a time.

The copied identifiers may be transferred from the copy device to anaccessing device. The accessing device may be the same device as thecopy device. The access device may comprise separate regions, sections,or partitions. The access device may have one or more columns, beadreservoirs, or magnetic regions for separating identifiers bound toaffinity tags (see Chemical Methods Section F about nucleic acidcapture). Alternatively, or in addition to, the access device may haveone or more size selection units. A size selection unit may includeagarose gel electrophoresis or any other method for size selectingnucleic acid molecules (see Chemical Methods Section E for moreinformation about nucleic acid size-selection). Copying and extractionmay be performed in the same region of a device or in different regionsof a device (see Chemical Methods Section D about nucleic acidamplification).

The accessed data may be read in the same device or the accessed datamay be transferred to another device. The reading device may comprise adetection unit to detect and identify the identifiers. The detectionunit may be part of a sequencer, hybridization array, or other unit foridentifying the presence or absence of an identifier. A sequencingplatform may be designed specifically for decoding and readinginformation encoded into nucleic acid sequences. The sequencing platformmay be dedicated to sequencing single or double stranded nucleic acidmolecules. The sequencing platform may decode nucleic acid encoded databy reading individual bases (e.g., base-by-base sequencing) or bydetecting the presence or absence of an entire nucleic acid sequence(e.g., component) incorporated within the nucleic acid molecule (e.g.,identifier). Alternatively, the sequencing platform may be a system suchas Illumina® Sequencing or fragmentation analysis by capillaryelectrophoresis. Alternatively or in addition to, decoding nucleic acidsequences may be performed using a variety of analytical techniquesimplemented by the device, including but not limited to, any methodsthat generate optical, electrochemical, or chemical signals.

Information storage in nucleic acid molecules may have variousapplications including, but not limited to, long term informationstorage, sensitive information storage, and storage of medicalinformation. In an example, a person's medical information (e.g.,medical history and records) may be stored in nucleic acid molecules andcarried on his or her person. The information may be stored external tothe body (e.g., in a wearable device) or internal to the body (e.g., ina subcutaneous capsule). When a patient is brought into a medical officeor hospital, a sample may be taken from the device or capsule and theinformation may be decoded with the use of a nucleic acid sequencer.Personal storage of medical records in nucleic acid molecules mayprovide an alternative to computer and cloud based storage systems.Personal storage of medical records in nucleic acid molecules may reducethe instance or prevalence of medical records being hacked. Nucleic acidmolecules used for capsule-based storage of medical records may bederived from human genomic sequences. The use of human genomic sequencesmay decrease the immunogenicity of the nucleic acid sequences in theevent of capsule failure and leakage.

Computer Systems

The present disclosure provides computer systems that are programmed toimplement methods of the disclosure. FIG. 19 shows a computer system1901 that is programmed or otherwise configured to encode digitalinformation into nucleic acid sequences and/or read (e.g., decode)information derived from nucleic acid sequences. The computer system1901 can regulate various aspects of the encoding and decodingprocedures of the present disclosure, such as, for example, thebit-values and bit location information for a given bit or byte from anencoded bitstream or byte stream.

The computer system 1901 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 1905, which can be a singlecore or multi core processor, or a plurality of processors for parallelprocessing. The computer system 1901 also includes memory or memorylocation 1910 (e.g., random-access memory, read-only memory, flashmemory), electronic storage unit 1915 (e.g., hard disk), communicationinterface 1920 (e.g., network adapter) for communicating with one ormore other systems, and peripheral devices 1925, such as cache, othermemory, data storage and/or electronic display adapters. The memory1910, storage unit 1915, interface 1920 and peripheral devices 1925 arein communication with the CPU 1905 through a communication bus (solidlines), such as a motherboard. The storage unit 1915 can be a datastorage unit (or data repository) for storing data. The computer system1901 can be operatively coupled to a computer network (“network”) 1930with the aid of the communication interface 1920. The network 1930 canbe the Internet, an internet and/or extranet, or an intranet and/orextranet that is in communication with the Internet. The network 1930 insome cases is a telecommunication and/or data network. The network 1930can include one or more computer servers, which can enable distributedcomputing, such as cloud computing. The network 1930, in some cases withthe aid of the computer system 1901, can implement a peer-to-peernetwork, which may enable devices coupled to the computer system 1901 tobehave as a client or a server.

The CPU 1905 can execute a sequence of machine-readable instructions,which can be embodied in a program or software. The instructions may bestored in a memory location, such as the memory 1910. The instructionscan be directed to the CPU 1905, which can subsequently program orotherwise configure the CPU 1905 to implement methods of the presentdisclosure. Examples of operations performed by the CPU 1905 can includefetch, decode, execute, and writeback.

The CPU 1905 can be part of a circuit, such as an integrated circuit.One or more other components of the system 1901 can be included in thecircuit. In some cases, the circuit is an application specificintegrated circuit (ASIC).

The storage unit 1915 can store files, such as drivers, libraries andsaved programs. The storage unit 1915 can store user data, e.g., userpreferences and user programs. The computer system 1901 in some casescan include one or more additional data storage units that are externalto the computer system 1901, such as located on a remote server that isin communication with the computer system 1901 through an intranet orthe Internet.

The computer system 1901 can communicate with one or more remotecomputer systems through the network 1930. For instance, the computersystem 1901 can communicate with a remote computer system of a user orother devices and or machinery that may be used by the user in thecourse of analyzing data encoded or decoded in a sequence of nucleicacids (e.g., a sequencer or other system for chemically determining theorder of nitrogenous bases in a nucleic acid sequence). Examples ofremote computer systems include personal computers (e.g., portable PC),slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab),telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device,Blackberry®), or personal digital assistants. The user can access thecomputer system 1901 via the network 1930.

Methods as described herein can be implemented by way of machine (e.g.,computer processor) executable code stored on an electronic storagelocation of the computer system 1901, such as, for example, on thememory 1910 or electronic storage unit 1915. The machine executable ormachine readable code can be provided in the form of software. Duringuse, the code can be executed by the processor 1905. In some cases, thecode can be retrieved from the storage unit 1915 and stored on thememory 1910 for ready access by the processor 1905. In some situations,the electronic storage unit 1915 can be precluded, andmachine-executable instructions are stored on memory 1910.

The code can be pre-compiled and configured for use with a machinehaving a processer adapted to execute the code, or can be compiledduring runtime. The code can be supplied in a programming language thatcan be selected to enable the code to execute in a pre-compiled oras-compiled fashion.

Aspects of the systems and methods provided herein, such as the computersystem 1901, can be embodied in programming Various aspects of thetechnology may be thought of as “products” or “articles of manufacture”typically in the form of machine (or processor) executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Machine-executable code can be stored on an electronicstorage unit, such as memory (e.g., read-only memory, random-accessmemory, flash memory) or a hard disk. “Storage” type media can includeany or all of the tangible memory of the computers, processors or thelike, or associated modules thereof, such as various semiconductormemories, tape drives, disk drives and the like, which may providenon-transitory storage at any time for the software programming. All orportions of the software may at times be communicated through theInternet or various other telecommunication networks. Suchcommunications, for example, may enable loading of the software from onecomputer or processor into another, for example, from a managementserver or host computer into the computer platform of an applicationserver. Thus, another type of media that may bear the software elementsincludes optical, electrical and electromagnetic waves, such as usedacross physical interfaces between local devices, through wired andoptical landline networks and over various air-links. The physicalelements that carry such waves, such as wired or wireless links, opticallinks or the like, also may be considered as media bearing the software.As used herein, unless restricted to non-transitory, tangible “storage”media, terms such as computer or machine “readable medium” refer to anymedium that participates in providing instructions to a processor forexecution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 1901 can include or be in communication with anelectronic display 1935 that comprises a user interface (UI) 1940 forproviding, for example, sequence output data including chromatographs,sequences as well as bits, bytes, or bit streams encoded by or read by amachine or computer system that is encoding or decoding nucleic acids,raw data, files and compressed or decompressed zip files to be encodedor decoded into DNA stored data. Examples of UI's include, withoutlimitation, a graphical user interface (GUI) and web-based userinterface.

Methods and systems of the present disclosure can be implemented by wayof one or more algorithms Δn algorithm can be implemented by way ofsoftware upon execution by the central processing unit 1905. Thealgorithm can, for example, be used with a DNA index and raw data or zipfile compressed or decompressed data, to determine a customized methodfor coding digital information from the raw data or zip file compresseddata, prior to encoding the digital information.

Chemical Methods Section A. Overlap Extension PCR (OEPCR) Assembly

In OEPCR, components are assembled in a reaction comprising polymeraseand dNTPs (deoxynucleotide tri phosphates comprising dATP, dTTP, dCTP,dGTP or variants or analogs thereof). Components can be single strandedor double stranded nucleic acids. Components to be assembled adjacent toeach other may have complementary 3′ ends, complementary 5′ ends, orhomology between one component's 5′ end and the adjacent component's 3′end. These end regions, termed “hybridization regions”, are intended tofacilitate the formation of hybridized junctions between the componentsduring OEPCR, wherein the 3′ end of one input component (or thecomplement thereof) is hybridized to the 3′ end of its intended adjacentcomponent (or the complement thereof). An assembled double-strandedproduct is then formed by polymerase extension. This product may then beassembled to more components through subsequent hybridization andextension. FIG. 7 illustrates an example schematic of OEPCR forassembling three nucleic acids.

In some embodiments, the OEPCR may comprise cycling between threetemperatures: a melting temperature, an annealing temperature, and anextension temperature. The melting temperature is intended to turndouble stranded nucleic acids into single stranded nucleic acids, aswell as remove the formation of secondary structures or hybridizationswithin a component or between components. Typically the meltingtemperature is high, for example above 95 degrees Celsius. In someembodiments the melting temperature may be at least 96, 97, 98, 99, 100,101, 102, 103, 104, or 105 degrees Celsius. In other embodiments themelting temperature may be at most 95, 94, 93, 92, 91, or 90 degreesCelsius. A higher melting temperature will improve dissociation ofnucleic acids and their secondary structures, but may also cause sideeffects such as the degradation of nucleic acids or the polymerase.Melting temperatures may be applied to the reaction for at least 1, 2,3, 4, 5 seconds, or above, such as 30 seconds, 1 minute, 2 minutes, or 3minutes.

The annealing temperature is intended to facilitate the formation ofhybridization between complementary 3′ ends of intended adjacentcomponents (or their complements). In some embodiments, the annealingtemperature may match the calculated melting temperature of the intendedhybridized nucleic acid formation. In other embodiments, the annealingtemperature may be within 10 degrees Celsius or more of said meltingtemperature. In some embodiments, the annealing temperature may be atleast 25, 30, 50, 55, 60, 65, or 70 degrees Celsius. The meltingtemperature may depend on the sequence of the intended hybridizationregion between components. Longer hybridization regions have highermelting temperatures, and hybridization regions with higher percentcontent of Guanine or Cytosine nucleotides may have higher meltingtemperatures. It may therefore be possible to design components forOEPCR reactions intended to assemble optimally at particular annealingtemperatures. Annealing temperatures may be applied to the reaction forat least 1, 5, 10, 15, 20, 25, or 30 seconds, or above.

The extension temperature is intended to initiate and facilitate thenucleic acid chain elongation of hybridized 3′ ends catalyzed by one ormore polymerase enzymes. In some embodiments, the extension temperaturemay be set at the temperature in which the polymerase functionsoptimally in terms of nucleic acid binding strength, elongation speed,elongation stability, or fidelity. In some embodiments, the extensiontemperature may be at least 30, 40, 50, 60, or 70 degrees Celsius, orabove. Annealing temperatures may be applied to the reaction for atleast 1, 5, 10, 15, 20, 25, 30, 40, 50, or 60 seconds or above.Recommended extension times may be around 15 to 45 seconds per kilobaseof expected elongation.

In some embodiments of OEPCR, the annealing temperature and theextension temperature may be the same. Thus a 2-step temperature cyclemay be used instead of a 3-step temperature cycle. Examples of combinedannealing and extension temperatures include 60, 65, or 72 degreesCelsius.

In some embodiments, OEPCR may be performed with one temperature cycle.Such embodiments may involve the intended assembly of just twocomponents. In other embodiments, OEPCR may be performed with multipletemperature cycles. Any give nucleic acid in OEPCR may only assemble toat most one other nucleic acid in one cycle. This is because assembly(or extension or elongation) may only occur at the 3′ end of a nucleicacid and each nucleic acid may only have one 3′ end. Therefore, theassembly of multiple components may require multiple temperature cycles.For example, assembling four components may involve 3 temperaturecycles. Assembling 6 components may involve 5 temperature cycles.Assembling 10 components may involve 9 temperature cycles. In someembodiments, using more temperature cycles than the minimum required mayincrease assembly efficiency. For example using four temperature cyclesto assemble two components may yield more product than only using onetemperature cycle. This is because the hybridization and elongation ofcomponents is a statistical event that occurs with a fraction of thetotal number of components in each cycle. So the total fraction ofassembled components may increase with increased cycles.

In addition to temperature cycling considerations, the design of thenucleic acid sequences in OEPCR may influence the efficiency of theirassembly to one another. Nucleic acids with long hybridization regionsmay hybridize more efficiently at a given annealing temperature comparedwith nucleic acids with short hybridization regions. This is because alonger hybridized product contains a larger number of stable base-pairsand may therefore be a more stable overall hybridized product than ashorter hybridized product. Hybridization regions may have a length ofat least 1, 2, 3 4, 5, 6, 7, 8, 9, 10, or more bases.

Hybridization regions with high guanine or cytosine content mayhybridize more efficiently at a given temperature than hybridizationregions with low guanine or cytosine content. This is because guanineforms a more stable base-pair with cytosine than adenine does withthymine. Hybridization regions may have a guanine or cytosine content(also known as GC content) of anywhere between 0% and 100%.

In addition to hybridization region length and GC content, there aremany more aspects of the nucleic acid sequence design that may affectthe efficiency of the OEPCR. For example, the formation of undesiredsecondary structures within a component may interfere with its abilityto form a hybridization product with its intended adjacent component.These secondary structures may include hairpin loops. The types ofpossible secondary structures and their stability (for example metingtemperature) for a nucleic acid may be predicted based on the sequence.Design space search algorithms may be used to determine nucleic acidsequences that meet proper length and GC content criteria for efficientOEPCR, while avoiding sequences with potentially inhibitory secondarystructures. Design space search algorithms may include geneticalgorithms, heuristic search algorithms, meta-heuristic searchstrategies like tabu search, branch-and-bound search algorithms, dynamicprogramming-based algorithms, constrained combinatorial optimizationalgorithms, gradient descent-based algorithms, randomized searchalgorithms, or combinations thereof.

Likewise, the formation of homodimers (nucleic acid molecules thathybridize with nucleic acid molecules of the same sequence) and unwantedheterodimers (nucleic acid sequences that hybridize with other nucleicacid sequences aside from their intended assembly partner) may interferewith OEPCR. Similar to secondary structures within a nucleic acid, theformation of homodimers and heterodimers may be predicted and accountedfor during nucleic acid design using computation methods and designspace search algorithms.

Longer nucleic acid sequences or higher GC content may create increasedformation of unwanted secondary structures, homodimers, and heterodimerswith the OEPCR. Therefore, in some embodiments, the use of shorternucleic acid sequences or lower GC content may lead to higher assemblyefficiency. These design principles may counteract the design strategiesof using long hybridization regions or high GC content for moreefficient assembly. As such, in some embodiments, OEPCR may be optimizedby using long hybridization regions with high GC content but shortnon-hybridization regions with low GC content. The overall length ofnucleic acids may be at least 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100bases, or above. In some embodiments, there may be an optimal length andoptimal GC content for the hybridization regions of nucleic acids wherethe assembly efficiency is optimized.

A larger number of distinct nucleic acids in an OEPCR reaction mayinterfere with the expected assembly efficiency. This is because alarger number of distinct nucleic acid sequences may create a higherprobability for undesirable molecular interactions, particularly in theform of heterodimers. Therefore in some embodiments of OEPCR thatassemble large numbers of components, nucleic acid sequence constraintsmay become more stringent for efficient assembly.

Primers for amplifying the anticipated final assembled product may beincluded in an OEPCR reaction. The OEPCR reaction may then be performedwith more temperature cycles to improve the yield of the assembledproduct, not just by creating more assemblies between the constituentcomponents, but also by exponentially amplifying the full assembledproduct in the manner of conventional PCR (see Chemical Methods SectionD).

Additives may be included in the OEPCR reaction to improve assemblyefficiency. For example, the addition of Betaine, Dimethyl sulfoxide(DMSO), non-ionic detergents, Formamide, Magnesium, Bovine Serum Albumin(BSA), or combinations thereof. Additive content (weight per volume) maybe at least 0%, 1%, 5%, 10%, 20%, or more.

Various polymerases may be used for OEPCR. The polymerase can benaturally occurring or synthesized. An example polymerase is a 029polymerase or derivative thereof. In some cases, a transcriptase or aligase is used (i.e., enzymes which catalyze the formation of a bond) inconjunction with polymerases or as an alternative to polymerases toconstruct new nucleic acid sequences. Examples of polymerases include aDNA polymerase, a RNA polymerase, a thermostable polymerase, a wild-typepolymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNApolymerase, bacteriophage T4 DNA polymerase Φ29 (phi29) DNA polymerase,Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwopolymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase,LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mthpolymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tnepolymerase, Tma polymerase, Tca polymerase, Tih polymerase, Tfipolymerase, Platinum Taq polymerases, Tbr polymerase, Phusionpolymerase, KAPA polymerase, Q5 polymerase, Tfl polymerase, Pfutubopolymerase, Pyrobest polymerase, KOD polymerase, Bst polymerase, Sacpolymerase, Klenow fragment polymerase with 3′ to 5′ exonucleaseactivity, and variants, modified products and derivatives thereof.Different polymerases may be stable and function optimally at differenttemperatures. Moreover, different polymerases have different properties.For example, some polymerases, such a Phusion polymerase, may exhibit 3′to 5′ exonuclease activity, which may contribute to higher fidelityduring nucleic acid elongation. Some polymerases may displace leadingsequences during elongation, while others may degrade them or haltelongation. Some polymerases, like Taq, incorporate an adenine base atthe 3′ end of nucleic acid sequences. This process is referred to asA-tailing and may be inhibitory to OEPCR as the addition of an Adeninebase may disrupt the designed 3′ complementarity between intendedadjacent components.

OEPCR may also be referred to as polymerase cycling assembly (or PCA).

B. Ligation Assembly

In ligation assembly, separate nucleic acids are assembled in a reactioncomprising one or more ligase enzymes and additional co-factors.Co-factors may include Adenosine Tri-Phosphate (ATP), Dithiothreitol(DTT), or Magnesium ion (Mg2+). During ligation, the 3′-end of onenucleic acid strand is covalently linked to the 5′ end of anothernucleic acid strand, thus forming an assembled nucleic acid. Componentsin a ligation reaction may be blunt-ended double stranded DNA (dsDNA),single stranded DNA (ssDNA), or partially hybridized single-strandedDNA. Strategies that bring the ends of nucleic acids together increasethe frequency of viable substrate for ligase enzymes, and thus may beused for improving the efficiency of ligase reactions. Blunt-ended dsDNAmolecules tend to form hydrophobic stacks on which ligase enzymes mayact, but a more successful strategy for bringing nucleic acids togethermay be to use nucleic acid components with either 5′ or 3′single-stranded overhangs that have complementarity for the overhangs ofcomponents to which they are intended to assemble. In the latterinstance, more stable nucleic acid duplexes may form due to base-basehybridization.

When a double stranded nucleic acid has an overhang strand on one end,the other strand on the same end may be referred to as a “cavity”.Together, a cavity and overhang form a “sticky end”, also known as a“cohesive-end”. A sticky end may be either a 3′ overhang and a 5′cavity, or a 5′ overhang and a 3′ cavity. The sticky-ends between twointended adjacent components may be designed to have complementaritysuch that the overhang of both sticky ends hybridize such that eachoverhang ends directly adjacent to the beginning of the cavity on theother component. This forms a “nick” (a double stranded DNA break) thatmay be “sealed” (covalently linked through a phosphodiester bond) by theaction of a ligase. See FIG. 8 for an example schematic of sticky endligation for assembling three nucleic acids. Either the nick on onestrand or the other, or both, may be sealed. Thermodynamically, the topand bottom strand of a molecule that forms a sticky end may move betweenassociated and dissociated states, and therefore the sticky end may be atransient formation. Once, however, the nick along one strand of asticky end duplex between two components is sealed, that covalentlinkage remains even if the members of the opposite strand dissociate.The linked strand may then become a template to which the intendedadjacent members of the opposite strand can bind and once again form anick that may be sealed.

Sticky ends may be created by digesting dsDNA with one or moreendonucleases. Endonucleases (that may be referred to as restrictionenzymes) may target specific sites (that may be referred to asrestriction sites) on either or both ends of dsDNA molecule, and createa staggered cleavage (sometimes referred to as a digestion) thus leavinga sticky end. See Chemical Methods Section C on restriction digests. Thedigest may leave a palindromic overhang (an overhang with a sequencethat is the reverse complement of itself). If so, then two componentsdigested with the same endonuclease may form complementary sticky endsalong which they may be assembled with a ligase. The digestion andligation may occur together in the same reaction if the endonuclease andligase are compatible. The reaction may occur at a uniform temperature,such as 4, 10, 16, 25, or 37 degrees Celsius. Or the reaction may cyclebetween multiple temperatures, such as between 16 degrees Celsius and 37degrees Celsius. Cycling between multiple temperatures may enable thedigestion and ligation to each proceed at their respective optimaltemperatures during different parts of the cycle.

It may be beneficial to perform the digestion and ligation in separatereactions. For example, if the desired ligases and the desiredendonucleases function optimally at different conditions. Or, forexample, if the ligated product forms a new restriction site for theendonuclease. In these instances, it may be better to perform therestriction digest and then the ligation separately, and perhaps it maybe further beneficial to remove the restriction enzyme prior toligation. Nucleic acids may be separated from enzymes throughphenol-chloroform extraction, ethanol precipitation, magnetic beadcapture, and/or silica membrane adsorption, washing, and elution.Multiple endonucleases may be used in the same reaction, though careshould be taken to ensure that the endonucleases do not interfere witheach other and function under similar reaction conditions. Using twoendonucleases, one may create orthogonal (non-complementary) sticky endson both ends of a dsDNA component.

Endonuclease digestion can leave sticky ends with phosphorylated 5′ends. Ligases may only function on phosphorylated 5′ ends, and not onnon-phosphorylated 5′ ends. As such, there may not be any need for anintermediate 5′ phosphorylation step in between digestion and ligation.A digested dsDNA component with a palindromic overhang on its sticky endmay ligate to itself. To prevent self-ligation, it may be beneficial todephosphorylate said dsDNA component prior to ligation.

Multiple endonucleases may target different restriction sites, but leavecompatible overhangs (overhangs that are the reverse complement of eachother). The product of ligation of sticky ends created with two suchendonucleases may result in an assembled product that does not contain arestriction site for either endonuclease at the site of ligation. Suchendonucleases form the basis of assembly methods, such as biobricksassembly, that may programmably assemble multiple components using justtwo endonucleases by performing repetitive digestion-ligation cycles.FIG. 20 illustrates an example of a digestion-ligation cycle usingendonucleases BamHI and BgIII with compatible overhangs.

In some embodiments, the endonucleases used to create sticky ends may betype IIS restriction enzymes. These enzymes cleave a fixed number ofbases away from their restriction sites in a particular direction,therefore the sequence of the overhangs that they generate may becustomized. The overhang sequences need not be palindromic. The sametype IIS restriction enzyme may be used to create multiple differentsticky ends in the same reaction, or in multiple reactions. Moreover,one or multiple type IIS restriction enzymes may be used to createcomponents with compatible overhangs in the same reaction, or inmultiple reactions. The ligation site between two sticky ends generatedby type IIS restriction enzymes may be designed such that it does notform a new restriction site. In addition, the type IIS restrictionenzyme sites may be placed on a dsDNA such that the restriction enzymecleaves off its own restriction site when it generates a component witha sticky end. Therefore the ligation product between multiple componentsgenerated from type IIS restriction enzymes may not contain anyrestriction sites.

Type IIS restriction enzymes may be mixed in a reaction together withligase to perform the component digestion and ligation together. Thetemperature of the reaction may be cycled between two or more values topromote optimal digestion and ligation. For example, the digestion maybe performed optimally at 37 degrees Celsius and the ligation may beperformed optimally at 16 degrees Celsius. More generally, the reactionmay cycle between temperature values of at least 0, 5, 10, 15, 20, 25,30, 35, 40, 45, 50, 55, 60, or 65 degrees Celsius or above. A combineddigestion and ligation reaction may be used to assemble at least 2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20components, or more. Examples of assembly reactions that leverage TypeIIS restriction enzymes to create sticky ends include Golden GateAssembly (also known as Golden Gate Cloning) or Modular Cloning (alsoknown as MoClo).

In some embodiments of ligation, exonucleases may be used to createcomponents with sticky ends. 3′ exonucleases may be used to chew backthe 3′ ends from dsDNA, thus creating 5′ overhangs. Likewise, 5′exonucleases may be used to chew back the 5′ ends from dsDNA thuscreating 3′ overhangs. Different exonucleases may have differentproperties. For example, exonucleases may differ in the direction oftheir nuclease activity (5′ to 3′ or 3′ to 5′), whether or not they acton ssDNA, whether they act on phosphorylated or non-phosphorylated 5′ends, whether or not they are able to initiate on a nick, or whether ornot they are able to initiate their activity on 5′ cavities, 3′cavities, 5′ overhangs, or 3′ overhangs. Different types of exonucleasesinclude Lambda exonuclease, RecJ_(f), Exonuclease III, Exonuclease I,Exonuclease T, Exonuclease V, Exonuclease VIII, Exonuclease VII,Nuclease BAL_31, T5 Exonuclease, and T7 Exonuclease.

Exonuclease may be used in a reaction together with ligase to assemblemultiple components. The reaction may occur at a fixed temperature orcycle between multiple temperatures, each ideal for the ligase or theexonuclease, respectively. Polymerase may be included in an assemblyreaction with ligase and a 5′-to-3′ exonuclease. The components in sucha reaction may be designed such that components intended to assembleadjacent to each other share homologous sequences on their edges. Forexample, a component X to be assembled with component Y may have a 3′edge sequence of the form 5′-z-3′, and the component Y may have a edgesequence of the form 5′-z-3′, where z is any nucleic acid sequence.Homologous edge sequences of such a form can be referred to as ‘gibsonoverlaps’. As the 5′ exonuclease chews back the 5′ end of dsDNAcomponents with gibson overlaps it creates compatible 3′ overhangs thathybridize to each other. The hybridized 3′ ends may then be extended bythe action of polymerase to the end of the template component, or to thepoint where the extended 3′ overhang of one component meets the 5′cavity of the adjacent component, thereby forming a nick that may besealed by a ligase. Such an assembly reaction where polymerase, ligase,and exonuclease are used together is often referred to as “Gibsonassembly”. Gibson assembly may be performed by using T5 exonuclease,Phusion polymerase, and Taq ligase, and incubating the reaction at 50degrees Celsius. In said instance, the use of the thermophilic ligase,Taq, enables the reaction to proceed at 50 degrees Celsius, atemperature suitable for all three types of enzymes in the reaction.

The term “Gibson assembly” may generally refer to any assembly reactioninvolving polymerase, ligase, and exonuclease. Gibson assembly may beused to assemble at least 2, 3, 4, 6, 7, 8, 9, 10, or more components.Gibson assembly may occur as a one-step, isothermal reaction or as amulti-step reaction with one or more temperature incubations. Forexample, Gibson assembly may occur at temperatures of at least 30, 40,50, 60, or 70 degrees, or less. The incubation time for a Gibsonassembly may be at least 1, 5, 10, 20, 40, or 80 minutes.

Gibson assembly reactions may occur optimally when gibson overlapsbetween intended adjacent components are a certain length and havesequence features, such as sequences that avoid undesirablehybridization events such as hairpins, homodimers, or unwantedheterodimers. Generally, gibson overlaps of at least 20 bases arerecommended. But Gibson overlaps may be at least 1, 2, 3, 5, 10, 20, 30,40, 50, 60, 100, or more bases in length. The GC content of a gibsonoverlap may be anywhere from 0% to 100%.

Though Gibson assembly is commonly described with a 5′ exonuclease, thereaction may also occur with a 3′ exonuclease. As the 3′ exonucleasechews back the 3′ end of dsDNA components, the polymerase counteractsthe action by extending the 3′ end. This dynamic process may continueuntil the 5′ overhang (created by the exonuclease) of two components(that share a gibson overlap) hybridize and the polymerase extends the3′ end of one component far enough to meet the 5′ end of its adjacentcomponent, thus leaving a nick that may be sealed by a ligase.

In some embodiments of ligation, components with sticky ends may becreated synthetically, as opposed to enzymatically, by mixing togethertwo single stranded nucleic acids, or oligos, that do not share fullcomplementarity. For example, two oligos, oligo X and oligo Y, may bedesigned to only fully hybridize along a contiguous string ofcomplementary bases that form a substring of a larger string of basesthat make up the entirety of either one or both oligos. Thiscomplementary string of bases is referred to as the “index region”. Ifthe index region occupies the entirety of oligo X and only the 5′ end ofoligo Y, then the oligos together form a component with a blunt end onone side and a sticky end on the other with a 3′ overhang from oligo Y(FIG. 21A). If the index region occupies the entirety of oligo X andonly the 3′ end of oligo Y, then the oligos together form a componentwith a blunt end on one side and a sticky end on the other with a 5′overhang from oligo Y (FIG. 21B). If the index region occupies theentirety of oligo X and neither end of oligo Y (implying that the indexregion is embedded within the middle of oligo Y), then the oligostogether form a component with a sticky end on one side with a 3′overhang from oligo Y and on the other side with a 5′ overhang fromoligo Y (FIG. 21C). If the index region occupies only the 5′ end ofoligo X and only the 5′ end of oligo Y, then the oligos together form acomponent with a sticky end on one side with a 3′ overhang from oligo Yand on the other side with a 3′ overhang from oligo X (FIG. 21D). If theindex region occupies only the 3′ end of oligo X and only the 3′ end ofoligo Y, then the oligos together form a component with a sticky end onone side with a 5′ overhang from oligo Y and on the other side with a 5′overhang from oligo X (FIG. 21E). In the aforementioned examples, thesequences of the overhangs are defined by the oligo sequences outside ofthe index region. These overhang sequences may be referred to ashybridization regions as they are the regions along which componentshybridize for ligation.

The index region and hybridization region(s) of oligos in sticky-endligation may be designed to facilitate the proper assembly ofcomponents. Components with long overhangs may hybridize moreefficiently with each other at a given annealing temperature comparedwith components with short overhangs. Overhangs may have a length of atleast 1, 2, 3 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, or more bases.

Components with overhangs that contain high guanine or cystosine contentmay hybridize more efficiently to their complementary component at agiven temperature than components with overhangs that contain lowguanine or cytosine content. This is because guanine forms a more stablebase-pair with cytosine than adenine does with thymine. Overhangs mayhave a guanine or cytosine content (also known as GC content) ofanywhere between 0% and 100%.

As with overhang sequences, the GC content and length of the indexregion of an oligo may also affect ligation efficiency. This is becausesticky-end components may assemble more efficiently if the top andbottom strand of each component are stably bound. Therefore, indexregions may be designed with higher GC content, longer sequences, andother features that promote higher melting temperatures. However, thereare many more aspects of the oligo design, for both the index region andoverhang sequence(s), that may affect the efficiency of the ligationassembly. For example, the formation of undesired secondary structureswithin a component may interfere with its ability to form an assembledproduct with its intended adjacent component. This may occur due toeither secondary structures in the index region, in the overhangsequence, or in both. These secondary structures may include hairpinloops. The types of possible secondary structures and their stability(for example meting temperature) for an oligo may be predicted based onthe sequence. Design space search algorithms may be used to determineoligo sequences that meet proper length and GC content criteria for theformation of effective components, while avoiding sequences withpotentially inhibitory secondary structures. Design space searchalgorithms may include genetic algorithms, heuristic search algorithms,meta-heuristic search strategies like tabu search, branch-and-boundsearch algorithms, dynamic programming-based algorithms, constrainedcombinatorial optimization algorithms, gradient descent-basedalgorithms, randomized search algorithms, or combinations thereof.

Likewise, the formation of homodimers (oligos that hybridize with oligosof the same sequence) and unwanted heterodimers (oligos that hybridizewith other oligos aside from their intended assembly partner) mayinterfere with ligation. Similar to secondary structures within acomponent, the formation of homodimers and heterodimers may be predictedand accounted for during oligo design using computation methods anddesign space search algorithms.

Longer oligo sequences or higher GC content may create increasedformation of unwanted secondary structures, homodimers, and heterodimerswithin the ligation reaction. Therefore, in some embodiments, the use ofshorter oligos or lower GC content may lead to higher assemblyefficiency. These design principles may counteract the design strategiesof using long oligos or high GC content for more efficient assembly. Assuch, there may be an optimal length and optimal GC content for theoligos that make up each component such that the ligation assemblyefficiency is optimized. The overall length of oligos to be used inligation may be at least 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100bases, or above. The overall GC content of oligos to be used in ligationmay be anywhere between 0% and 100%.

In addition to sticky end ligation, ligation may also occur betweensingle-stranded nucleic acids using staple (or template or bridge)strands. This method may be referred to as staple strand ligation (SSL),template directed ligation (TDL), or bridge strand ligation. See FIG.10A for an example schematic of TDL for assembling three nucleic acids.In TDL, two single stranded nucleic acids hybridize adjacently onto atemplate, thus forming a nick that may be sealed by a ligase. The samenucleic acid design considerations for sticky end ligation also apply toTDL. Stronger hybridization between the templates and their intendedcomplementary nucleic acid sequences may lead to increased ligationefficiency. Therefore sequence features that improve the hybridizationstability (or melting temperature) on each side of the template mayimprove ligation efficiency. These features may include longer sequencelength and higher GC content. The length of nucleic acids in TDL,including templates, may be at least 5, 10, 20, 30, 50, 60, 70, 80, 90,or 100 bases, or above. The GC content of nucleic acids, includingtemplates, may be anywhere between 0% and 100%.

In TDL, as with sticky end ligation, care may be taken to designcomponent and template sequences that avoid unwanted secondarystructures by using nucleic acid structure-predicting software withsequence space search algorithms. As the components in TDL may be singlestranded instead of double stranded, there may be higher incidence ofunwanted secondary structures (as compared to sticky end ligation) dueto the exposed bases.

TDL may also be performed with blunt-ended dsDNA components. In suchreactions, in order for the staple strand to properly bridge twosingle-stranded nucleic acids, the staple may first need to displace orpartially displace the full single-stranded complements. To facilitatethe TDL reaction with dsDNA components, the dsDNA may initially bemelted with incubation at a high temperature. The reaction may then becooled thus allowing staple strands to anneal to their proper nucleicacid complements. This process may be made even more efficient by usinga relatively high concentration of template compared to dsDNAcomponents, thus enabling the templates to outcompete the properfull-length ssDNA complements for binding. Once two ssDNA strands getassembled by their template and a ligase, that assembled nucleic acidmay then become a template for the opposite full-length ssDNAcomplements. Therefore, ligation of blunt-ended dsDNA with TDL may beimproved through multiple rounds of melting (incubation at highertemperatures) and annealing (incubation at lower temperatures). Thisprocess may be referred to as Ligase Cyling Reaction, or LCR. Propermelting and annealing temperatures depend on the nucleic acid sequences.Melting and annealing temperatures may be at least 4, 10, 20, 30, 40,50, 60, 70, 80, 90, or 100 degrees Celsius. The number of temperaturecycles may be at least 1, 5, 10, 15, 20, 15, 30, or more.

All ligations may be performed in fixed temperature reactions or inmulti-temperature reactions. Ligation temperatures may be at least 0, 4,10, 20, 20, 30, 40, 50, or 60 degrees Celsius or above. The optimaltemperature for ligase activity may differ depending on the type ofligase. Moreover, the rate at which components adjoin or hybridize inthe reaction may differ depending on their nucleic acid sequences.Higher incubation temperatures may promote faster diffusion andtherefore increase the frequency with which components temporarilyadjoin or hybridize. However increased temperature may also disruptbasepair bonds and therefore decrease the stability of those adjoined orhybridized component duplexes. The optimal temperature for ligation maydepend on the number of nucleic acids to be assembled, the sequences ofthose nucleic acids, the type of ligase, as well as other factors suchas reaction additives. For example, two sticky end components with4-base complementary overhangs may assembled faster at 4 degrees Celsiuswith T4 ligase than at 25 degrees Celsius with T4 ligase. But twosticky-end components with 25-base complementary overhangs may assemblefaster at 25 degrees Celsius with T4 ligase than at 4 degrees Celsiuswith T4 ligase, and perhaps faster than ligation with 4-base overhangsat any temperature. In some embodiments of ligation, it may bebeneficial to heat and slowly cool the components for annealing prior toligase addition.

Ligation may be used to assembled at least 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more nucleic acids. Ligationincubation times may be at most 30 seconds, 1 minute, 2 minutes, 5minutes, 10 minutes, 20 minutes, 30 minutes, 1 hour, or longer. Longerincubation times may improve ligation efficiency.

Ligation may require nucleic acids with 5′ phosphorylated ends. Nucleicacid components without 5′ phosphorylated ends may be phosphorylated ina reaction with polynucleotide kinase, such as T4 polynucleotide kinase(or T4 PNK). Other co-factors may be present in the reaction such asATP, magnesium ion, or DTT. Polynucleotide kinase reactions may occur at37 degrees Celsius for 30 minutes. Polynucleotide kinase reactiontemperatures may be at least 4, 10, 20, 20, 30, 40, 50, or 60 degreesCelsius. Polynucleotide kinase reaction incubation times may be at most,1 minute, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 60 minutes, ormore. Alternatively, the nucleic acid components may be synthetically(as opposed to enzymatically) designed and manufactured with a modified5′ phosphorylation. Only nucleic acids being assembled on their 5′ endsmay require phosphorylation. For example, templates in TDL may not bephosphorylated as they are not intended to be assembled.

Additives may be included in a ligation reaction to improve ligationefficiency. For example, the addition of Dimethyl sulfoxide (DMSO),polyethylene glycol (PEG), 1,2-Propanediol (1,2-Prd), glycerol, Tween-20or combinations thereof. PEG6000 may be a particularly effectiveligation enhancer. PEG6000 may increase ligation efficiency by acting asa crowding agent. For example, the PEG6000 may form aggregated nodulesthat take up space in the ligase reaction solution and bring the ligaseand components to closer proximity. Additive content (weight per volume)may be at least 0%, 1%, 5%, 10%, 20%, or more.

Various ligases may be used for ligation. The ligases can be naturallyoccurring or synthesized. Examples of ligases include T4 DNA Ligase, T7DNA Ligase, T3 DNA Ligase, Taq DNA Ligase, 9° N™ DNA Ligase, E. coli DNALigase, and SplintR DNA Ligase. Different ligases may be stable andfunction optimally at different temperatures. For example, Taq DNALigase is thermostable and T4 DNA Ligase is not. Moreover, differentligases have different properties. For example, T4 DNA Ligase may ligateblunt-ended dsDNA while T7 DNA Ligase may not.

Ligation may be used to attach sequencing adapters to a library ofnucleic acids. For example, the ligation may be performed with commonsticky ends or staples at the ends of each member of the nucleic acidlibrary. If the sticky end or staple at one end of the nucleic acids isdistinct from that of the other end, then the sequencing adapters may beligated asymmetrically. For example, a forward sequencing adapter may beligated to one end of the members of the nucleic acid library and areverse sequencing adapter may be ligate to the other end of the membersof the nucleic acid library. Alternatively, blunt-ended ligation may beused to attach adapters to a library of blunt-ended double-strandednucleic acids. Fork adapters may be used to asymmetrically attachadapters to a nucleic acid library with either blunt ends or sticky endsthat are equivalent at each end (such as A-tails).

Ligation may be inhibited by heat inactivation (for example incubationat 65 degrees Celsius for at least 20 minutes), addition of adenaturant, or addition of a chelator such as EDTA.

C. Restriction Digest

Restriction digests are reactions in which restriction endonucleases (orrestriction enzymes) recognize their cognate restriction site on nucleicacids and subsequently cleave (or digest) the nucleic acids containingsaid restriction site. Type I, type II, type III, or type IV restrictionenzymes may be used for restriction digests. Type II restriction enzymesmay be the most efficient restriction enzymes for nucleic aciddigestions. Type II restriction enzymes may recognize palindromicrestriction sites and cleave nucleic acids within the recognition site.Examples of said restriction enzymes (and their restriction sites)include AatII (GACGTC), AfeI (AGCGCT), ApaI (GGGCCC), DpnI (GATC), EcoRI(GAATTC), NgeI (GCTAGC), and many more. Some restriction enzymes, suchas DpnI and AfeI, may cut their restriction sites in the center, thusleaving blunt-ended dsDNA products. Other restriction enzymes, such asEcoRI and AatII, cut their restriction sites off-center, thus leavingdsDNA products with sticky ends (or staggered ends). Some restrictionenzymes may target discontinuous restriction sites. For example, therestriction enzyme AlwNI recognizes the restriction site CAGNNNCTG,where N may be either A, T, C, or G. Restriction sites may be at least2, 4, 6, 8, 10, or more bases long.

Some Type II restriction enzymes cleave nucleic acids outside of theirrestriction sites. The enzymes may be sub-classified as either Type IISor Type JIG restriction enzymes. Said enzymes may recognize restrictionsites that are non-palindromic. Examples of said restriction enzymesinclude BbsI, that recognizes GAAAC and creates a staggered cleavage 2(same strand) and 6 (opposite strand) bases further downstream. Anotherexample includes BsaI, that recognizes GGTCTC and creates a staggeredcleavage 1 (same strand) and 5 (opposite strand) bases furtherdownstream. Said restriction enzymes may be used for golden gateassembly or modular cloning (MoClo). Some restriction enzymes, such asBcgI (a Type IIG restriction enzyme) may create a staggered cleavage onboth ends of its recognition site. Restriction enzymes may cleavenucleic acids at least 1, 5, 10, 15, 20, or more bases away from theirrecognition sites. Because said restriction enzymes may create staggeredcleavages outside of their recognitions sites, the sequences of theresulting nucleic acid overhangs may be arbitrarily designed. This is asopposed to restriction enzymes that create staggered cleavages withintheir recognition sites, where the sequence of a resulting nucleic acidoverhang is coupled to the sequence of the restriction site. Nucleicacid overhangs created by restriction digests may be at least 1, 2, 3,4, 5, 6, 7, 8, or more bases long. When restriction enzymes cleavenucleic acids, the resulting 5′ ends contain a phosphate.

One or more nucleic acid sequences may be included in a restrictiondigest reaction. Likewise, one or more restriction enzymes may be usedtogether in a restriction digest reaction. Restriction digests maycontain additives and cofactors including potassium ion, magnesium ion,sodium ion, BSA, S-Adenosyl-L-methionine (SAM), or combinations thereof.Restriction digest reactions may be incubated at 37 degrees Celsius forone hour. Restriction digest reactions may be incubated in temperaturesof at least 0, 10, 20, 30, 40, 50, or 60 degrees Celsius. Optimal digesttemperatures may depend on the enzymes. Restriction digest reactions maybe incubated for at most 1, 10, 30, 60, 90, 120, or more minutes. Longerincubation times may result in increased digestion.

D. Nucleic Acid Amplification

Nucleic acid amplification may be executed with polymerase chainreaction, or PCR. In PCR, a starting pool of nucleic acids (referred toas the template pool or template) may be combined with polymerase,primers (short nucleic acid probes), nucleotide tri phosphates (such asdATP, dTTP, dCTP, dGTP, and analogs or variants thereof), and additionalcofactors and additives such as betaine, DMSO, and magnesium ion. Thetemplate may be single stranded or double stranded nucleic acids. Theprimer may be a short nucleic acid sequence built synthetically tocomplement and hybridize to a target sequence in the template pool.Typically, there are two primers in a PCR reaction, one to complement aprimer binding site on the top strand of a target template, and anotherto complement a primer binding site on the bottom strand of the targettemplate downstream of the first binding site. The 5′-to-3′ orientationin which these primers bind their target must be facing each other inorder to successfully replicate and exponentially amplify the nucleicacid sequence in between them. Though “PCR” may typically refer toreactions specifically of said form, it may also be used more generallyto refer to any nucleic acid amplification reaction.

In some embodiments, PCR may comprise cycling between threetemperatures: a melting temperature, an annealing temperature, and anextension temperature. The melting temperature is intended to turndouble stranded nucleic acids into single stranded nucleic acids, aswell as remove the formation of hybridization products and secondarystructures. Typically the melting temperature is high, for example above95 degrees Celsius. In some embodiments the melting temperature may beat least 96, 97, 98, 99, 100, 101, 102, 103, 104, or 105 degreesCelsius. In other embodiments the melting temperature may be at most 95,94, 93, 92, 91, or 90 degrees Celsius. A higher melting temperature willimprove dissociation of nucleic acids and their secondary structures,but may also cause side effects such as the degradation of nucleic acidsor the polymerase. Melting temperatures may be applied to the reactionfor at least 1, 2, 3, 4, 5 seconds, or above, such as 30 seconds, 1minute, 2 minutes, or 3 minutes. A longer initial melting temperaturestep may be recommended for PCR with complex or long template.

The annealing temperature is intended to facilitate the formation ofhybridization between the primers and their target templates. In someembodiments, the annealing temperature may match the calculated meltingtemperature of the primer. In other embodiments, the annealingtemperature may be within 10 degrees Celsius or more of said meltingtemperature. In some embodiments, the annealing temperature may be atleast 25, 30, 50, 55, 60, 65, or 70 degrees Celsius. The meltingtemperature may depend on the sequence of the primer. Longer primers mayhave higher melting temperatures, and primers with higher percentcontent of Guanine or Cystosine nucleotides may have higher meltingtemperatures. It may therefore be possible to design primers intended toassemble optimally at particular annealing temperatures. Annealingtemperatures may be applied to the reaction for at least 1, 5, 10, 15,20, 25, or 30 seconds, or above. To help ensure annealing, the primerconcentrations may be at high or saturating amounts. Primerconcentrations may be 500 nanomolar (nM). Primer concentrations may beat most 1 nM, 10 nM, 100 nM, 1000 nM, or more.

The extension temperature is intended to initiate and facilitate the 3′end nucleic acid chain elongation of primers catalyzed by one or morepolymerase enzymes. In some embodiments, the extension temperature maybe set at the temperature in which the polymerase functions optimally interms of nucleic acid binding strength, elongation speed, elongationstability, or fidelity. In some embodiments, the extension temperaturemay be at least 30, 40, 50, 60, or 70 degrees Celsius, or above.Annealing temperatures may be applied to the reaction for at least 1, 5,10, 15, 20, 25, 30, 40, 50, or 60 seconds or above. Recommendedextension times may be approximately 15 to 45 seconds per kilobase ofexpected elongation.

In some embodiments of PCR, the annealing temperature and the extensiontemperature may be the same. Thus a 2-step temperature cycle may be usedinstead of a 3-step temperature cycle. Examples of combined annealingand extension temperatures include 60, 65, or 72 degrees Celsius.

In some embodiments, PCR may be performed with one temperature cycle.Such embodiments may involve turning targeted single stranded templatenucleic into double stranded nucleic acid. In other embodiments, PCR maybe performed with multiple temperature cycles. If the PCR is efficient,it is expected that the number of target nucleic acid molecules willdouble each cycle, thereby creating an exponential increase in thenumber of targeted nucleic acid templates from the original templatepool. The efficiency of PCR may vary. Therefore, the actual percent oftargeted nucleic acid that is replicated each round may be more or lessthan 100%. Each PCR cycle may introduce undesirable artifacts such asmutated and recombined nucleic acids. To curtail this potentialdetriment, a polymerase with high fidelity and high processivity may beused. In addition, a limited number of PCR cycles may be used. PCR mayinvolve at most 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, or more cycles.

In some embodiments, multiple distinct target nucleic acid sequences mayamplified together in one PCR. If each target sequence has common primerbinding sites, then all nucleic acid sequences may be amplified with thesame set of primers. Alternatively, PCR may comprise multiple primersintended to each target distinct nucleic acids. Said PCR may be referredto as multiplex PCR. PCR may involve at most 1, 2, 3, 4, 5, 6, 7, 8, 9,10, or more distinct primers. In PCR with multiple distinct nucleic acidtargets, each PCR cycle may change the relative distribution of thetargeted nucleic acids. For example, a uniform distribution may becomeskewed or non-uniformly distributed. To curtail this potentialdetriment, optimal polymerases (e.g., with high fidelity and sequencerobustness) and optimal PCR conditions may be used. Factors such asannealing and extension temperature and time may be optimized. Inaddition, a limited number of PCR cycles may be used.

In some embodiments of PCR, a primer with base mismatches to itstargeted primer binding site in the template may be used to mutate thetarget sequence. In some embodiments of PCR, a primer with an extrasequence on its 5′ end (known as an overhang) may be used to attach asequence to its targeted nucleic acid. For example, primers containingsequencing adapters on their 5′ ends may be used to prepare and/oramplify a nucleic acid library for sequencing. Primers that targetsequencing adapters may be used to amplify nucleic acid libraries tosufficient enrichment for certain sequencing technologies.

In some embodiments, linear-PCR (or asymmetric-PCR) is used whereinprimers only target one strand (not both strands) of a template. Inlinear-PCR the replicated nucleic acid from each cycle is notcomplemented to the primers, so the primers do not bind it. Therefore,the primers only replicate the original target template with each cycle,hence the linear (as opposed to exponential) amplification. Though theamplification from linear-PCR may not be as fast as conventional(exponential) PCR, the maximal yield may be greater. Theoretically, theprimer concentration in linear-PCR may not become a limiting factor withincreased cycles and increased yield as it would with conventional PCR.Linear-After-The-Exponential-PCR (or LATE-PCR) is a modified version oflinear-PCR that may be capable of particularly high yields.

In some embodiments of nucleic acid amplification, the process ofmelting, annealing, and extension may occur at a single temperature.Such PCR may be referred to as isothermal PCR. Isothermal PCR mayleverage temperature-independent methods for dissociating or displacingthe fully-complemented strands of nucleic acids from each other in favorof primer binding. Strategies include loop-mediated isothermalamplification, strand displacement amplification, helicase-dependentamplification, and nicking enzyme amplification reaction. Isothermalnucleic acid amplification may occur at temperatures of at most 20, 30,40, 50, 60, or 70 degrees Celsius or more.

In some embodiments, PCR may further comprise a fluorescent probe or dyeto quantify the amount of nucleic acid in a sample. For example, the dyemay interpolate into double stranded nucleic acids. An example of saiddye is SYBR Green. A fluorescent probe may also be a nucleic acidsequence attached to a fluorescent unit. The fluorescent unit may berelease upon hybridization of the probe to a target nucleic acid andsubsequent modification from an extending polymerase unit. Examples ofsaid probes include Taqman probes. Such probes may be used inconjunction with PCR and optical measurement tools (for excitation anddetection) to quantify nucleic acid concentration in a sample. Thisprocess may be referred to as quantitative PCR (qPCR) or real-time PCR(rtPCR).

In some embodiments, a PCR may be performed on single a moleculetemplate (in a process that may be referred to as single-molecule PCR),rather than on a pool of multiple template molecules. For example,emulsion-PCR (ePCR) may be used to encapsulate single nucleic acidmolecules within water droplets within an oil emulsion. The waterdroplets may also contain PCR reagents, and the water droplets may beheld in a temperature-controlled environment capable of requisitetemperature cycling for PCR. This way, multiple self-contained PCRreactions may occur simultaneously in high throughput. The stability ofoil emulsions may be improved with surfactants. The movement of dropletsmay be controlled with pressure through microfluidic channels.Microfluidic devices may be used to create droplets, split droplets,merge droplets, inject material intro droplets, and to incubatedroplets. The size of water droplets in oil emulsions may be at least 1picoliter (pL), 10 pL, 100 pL, 1 nanoliter (nL), nL, 100 nL, or more.

In some embodiments, single-molecule PCR may be performed one asolid-phase substrate. Examples include the Illumina solid-phaseamplification method or variants thereof. The template pool may beexposed to a solid-phase substrate, wherein the solid phase substratemay immobilize templates at a certain spatial resolution. Bridgeamplification may then occur within the spatial neighborhood of eachtemplate thereby amplifying single molecules in a high throughputfashion on the substrate.

High-throughput, single-molecule PCR may be useful for amplifying a poolof distinct nucleic acids that may interfere with each other. Forexample, if multiple distinct nucleic acids share a common sequenceregion, then recombination between the nucleic acids along this commonregion may occur during the PCR reaction, resulting in new, recombinednucleic acids. Single-molecule PCR would prevent this potentialamplification error as it compartmentalizes distinct nucleic acidsequences from each other so they may not interact. Single-molecule PCRmay be particularly useful for preparing nucleic acids for sequencing.Single-molecule PCR mat also be useful for absolute quantitation of anumber of targets within a template pool. For example, digital PCR (ordPCR), uses the frequency of distinct single-molecule PCR amplificationsignals to estimate the number of starting nucleic acid molecules in asample.

In some embodiments of PCR, a group of nucleic acids may benon-discriminantly amplified using primers for primer binding sitescommon to all nucleic acids. For example, primers for primer bindingsites flanking all nucleic acids in a pool. Synthetic nucleic acidlibraries may be created or assembled with these common sites forgeneral amplification. However, in some embodiments, PCR may be used toselectively amplify a targeted subset of nucleic acids from a pool. Forexample, by using primers with primer binding sites that only appear onsaid targeted subset of nucleic acids. Synthetic nucleic acid librariesmay be created or assembled such that nucleic acids belonging topotential sub-libraries of interest all share common primer bindingsites on their edges (common within the sub-library but distinct fromother sub-libraries) for selective amplification of the sub-library fromthe more general library. In some embodiments, PCR may be combined withnucleic acid assembly reactions (such as ligation or OEPCR) toselectively amplify fully assembled or potentially fully assemblednucleic acids from partially assembled or mis-assembled (or unintendedor undesirable) bi-products. For example, the assembly may involveassembling a nucleic acid with a primer binding site on each edgesequence such that only a full assembled nucleic product would containthe requisite two primer binding sites for amplification. In saidexample, a partially assembled product may contain neither or only oneof the edge sequences with the primer binding sites, and thereforeshould not be amplified. Likewise a mis-assembled (or unintended orundesirable) product may contain neither or only one of the edgesequences, or both edge sequences but in the incorrect orientation orseparated by an incorrect amount of bases. Therefore said mis-assembledproduct should either not amplify or amplify to create a product ofincorrect length. In the latter case the amplified mis-assembled productof incorrect length may be separated from the amplified fully assembledproduct of correct length by nucleic acid size selection methods (seeChemical Methods Section E), such as DNA electrophoresis in an agarosegel followed by gel extraction.

Additives may be included in the PCR to improve the efficiency ofnucleic acid amplification. For example, the addition of Betaine,Dimethyl sulfoxide (DMSO), non-ionic detergents, Formamide, Magnesium,Bovine Serum Albumin (BSA), or combinations thereof. Additive content(weight per volume) may be at least 0%, 1%, 5%, 10%, 20%, or more.

Various polymerases may be used for PCR. The polymerase can be naturallyoccurring or synthesized. An example polymerase is a Φ29 polymerase orderivative thereof. In some cases, a transcriptase or a ligase is used(i.e., enzymes which catalyze the formation of a bond) in conjunctionwith polymerases or as an alternative to polymerases to construct newnucleic acid sequences. Examples of polymerases include a DNApolymerase, a RNA polymerase, a thermostable polymerase, a wild-typepolymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNApolymerase, bacteriophage T4 DNA polymerase Φ29 (phi29) DNA polymerase,Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwopolymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase,LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mthpolymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tnepolymerase, Tma polymerase, Tca polymerase, Tih polymerase, Tfipolymerase, Platinum Taq polymerases, Tbr polymerase, Phusionpolymerase, KAPA polymerase, Q5 polymerase, Tfl polymerase, Pfutubopolymerase, Pyrobest polymerase, KOD polymerase, Bst polymerase, Sacpolymerase, Klenow fragment polymerase with 3′ to 5′ exonucleaseactivity, and variants, modified products and derivatives thereof.Different polymerases may be stable and function optimally at differenttemperatures. Moreover, different polymerases have different properties.For example, some polymerases, such a Phusion polymerase, may exhibit 3′to 5′ exonuclease activity, which may contribute to higher fidelityduring nucleic acid elongation. Some polymerases may displace leadingsequences during elongation, while others may degrade them or haltelongation. Some polymerases, like Taq, incorporate an adenine base atthe 3′ end of nucleic acid sequences. Additionally, some polymerases mayhave higher fidelity and processivity than others and may be moresuitable to PCR applications, such as sequencing preparation, where itis important for the amplified nucleic acid yield to have minimalmutations and where it is important for the distribution of distinctnucleic acids to maintain uniform distribution throughout amplification.

E. Size Selection

Nucleic acids of a particular size may be selected from a sample usingsize-selection techniques. In some embodiments, size-selection may beperformed using gel electrophoresis or chromatography. Liquid samples ofnucleic acids may be loaded onto one terminal of a stationary phase orgel (or matrix). A voltage difference may be placed across the gel suchthat the negative terminal of the gel is the terminal at which thenucleic acid samples are loaded and the positive terminal of the gel isthe opposite terminal. Since the nucleic acids have a negatively chargedphosphate backbone, they will migrate across the gel to the positiveterminal. The size of the nucleic acid will determine it's relativespeed of migration through the gel. Therefore nucleic acids of differentsizes will resolve on the gel as they migrate. Voltage differences maybe 100V or 120V. Voltage differences may be at most 50V, 100V, 150V,200V, 250V, or more. Larger voltage differences may increase the speedof nucleic acid migration and size resolution. However, larger voltagedifferences may also damage the nucleic acids or the gel. Larger voltagedifferences may be recommended for resolving nucleic acids of largersizes. Typical migration times may be between 15 minutes and 60 minutes.Migration times may be at most 10 minutes, 30 minutes, 60 minutes, 90minutes, 120 minutes, or more. Longer migration times, similar to highervoltage, may lead to better nucleic acid resolution but may lead toincreased nucleic acid damage. Longer migration times may be recommendedfor resolving nucleic acids of larger sizes. For example, a voltagedifference of 120V and a migration time of 30 minutes may be sufficientfor resolving a 200-base nucleic acid from a 250-base nucleic acid.

The properties of the gel, or matrix, may affect the size-selectionprocess. Gels typically comprise a polymer substance, such as agarose orpolyacrylamide, dispersed in a conductive buffer such as TAE(Tris-acetate-EDTA) or TBE (Tris-borate-EDTA). The content (weight pervolume) of the substance (e.g. agarose or acrylamide) in the gel may beat most 0.5%, 1%, 2%, 3%, 5%, 10%, 15%, 20%, 25%, or higher. Highercontent may decrease migration speed. Higher content may be preferablefor resolving smaller nucleic acids. Agarose gels may be better forresolving double stranded DNA (dsDNA). Polyacrylamide gels may be betterfor resolving single stranded DNA (ssDNA). The preferred gel compositionmay depend on the nucleic acid type and size, the compatibility ofadditives (e.g., dyes, stains, denaturing solutions, or loading buffers)as well as the anticipate downstream applications (e.g., gel extractionthen ligation, PCR, or sequencing). Agarose gels may be simpler for gelextraction than polyacrylamide gels. TAE, though not as good a conductoras TBE, may also be better for gel extraction because borate (an enzymeinhibitor) carry-over in the extraction process may inhibit downstreamenzymatic reactions.

Gels may further comprise a denaturing solution such as SDS (sodiumdodecyl sulfate) or urea. SDS may be used, for example, to denatureproteins or to separate nucleic acids from potentially bound proteins.Urea may be used to denature secondary structures in DNA. For example,urea may convert dsDNA into ssDNA, or urea may convert a folded ssDNA(for example a hairpin) to a non-folded ssDNA. Urea-polyacrylamide gels(further comprising TBE) may be used for accurately resolving ssDNA.

Samples may be incorporate into gels with different formats. In someembodiments, gels may contain wells in which samples may be loadedmanually. One gel may have multiple wells for running multiple nucleicacids samples. In other embodiments, the gels may be attached tomicrofluidic channels that automatically load the nucleic acidsample(s). Each gel may be downstream of several microfluidic channels,or the gels themselves may each occupy separate microfluidic channels.The dimensions of the gel may affect the sensitivity of nucleic aciddetection (or visualization). For example, thin gels or gels inside ofmicrofluidic channels (such as in bioanalyzers or tapestations) mayimprove the sensitivity of nucleic acid detection. The nucleic aciddetection step may be important for selecting and extracting a nucleicacid fragment of the correct size.

A ladder may be loaded into a gel for nucleic acid size reference. Theladder may contain markers of different sizes to which the nucleic acidsample may be compared. Different ladders may have different size rangesand resolutions. For example a 50 base ladder may have markers at 50,100, 150, 200, 250, 300, 350, 400, 450, 500, 550, and 600 bases. Saidladder may be useful for detecting and selecting nucleic acids withinthe size range of 50 and 600 bases. The ladder may also be used as astandard for estimating the concentration of nucleic acids of differentsizes in a sample.

Nucleic acid samples and ladders may be mixed with loading buffer tofacilitate the gel electrophoresis (or chromatography) process. Loadingbuffer may contain dyes and markers to help track the migration of thenucleic acids. Loading buffer may further comprise reagents (such asglycerol) that are denser than the running buffer (e.g., TAE or TBE), toensure that nucleic acid samples sink to the bottom of the sampleloading wells (which may be submerged in the running buffer). Loadingbuffer may further comprise denaturing agents such as SDS or urea.Loading buffer may further comprise reagents for improving the stabilityof nucleic acids. For example, loading buffer may contain EDTA toprotect nucleic acids from nucleases.

In some embodiments, the gel may comprise a stain that binds the nucleicacid and that may be used to optically detect nucleic acids of differentsizes. Stains may be specific for dsDNA, ssDNA, or both. Differentstains may be compatible with different gel substances. Some stains mayrequire excitation from a source light (or electromagnetic wave) inorder to visualize. The source light may be UV (ultraviolet) or bluelight. In some embodiments, stains may be added to the gel prior toelectrophoresis. In other embodiments, stains may be added to the gelafter electrophoresis. Examples of stains include Ethidium Bromide(EtBr), SYBR Safe, SYBR Gold, silver stain, or methylene blue. Areliable method for visualizing dsDNA of a certain size, for example,may be to use an agarose TAE gel with a SYBR Safe or EtBr stain. Areliable method for visualizing ssDNA of a certain size, for example,may be to use a urea-polyacrylamide TBE gel with a methylene blue orsilver stain.

In some embodiments, the migration of nucleic acids through gels may bedriven by other methods besides electrophoresis. For example, gravity,centrifugation, vacuums, or pressure may be used to drive nucleic acidsthrough gels so that they may resolve according to their size.

Nucleic acids of a certain size may be extracted from gels using a bladeor razor to excise the band of gel containing the nucleic acid. Properoptical detection techniques and DNA ladders may be used to ensure thatthe excision occurs precisely at a certain band and that the excisionsuccessfully excludes nucleic acids that may belong to different,undesirable size bands. The gel band may be incubated with buffer todissolve it, thus releasing the nucleic acids into the buffer solution.Heat or physical agitation may speed the dissolution. Alternatively, thegel band may be incubated in buffer long enough to allow diffusion ofthe DNA into the buffer solution without requiring gel dissolution. Thebuffer may then be separated from the remaining solid-phase gel, forexample by aspiration or centrifugation. The nucleic acids may then bepurified from the solution using standard purification orbuffer-exchange techniques, such as phenol-chloroform extraction,ethanol precipitation, magnetic bead capture, and/or silica membraneadsorption, washing, and elution. Nucleic acids may also be concentratedin this step.

As an alternative to gel excision, nucleic acids of a certain size maybe separated from a gel by allowing them to run off the gel. Migratingnucleic acids may pass through a basin (or well) either embedded in thegel or at the end of the gel. The migration process may be timed oroptically monitored such that when the nucleic acid group of a certainsize enters the basin, the sample is collected from the basin. Thecollection may occur, for example, by aspiration. The nucleic acids maythen be purified from the collected solution using standard purificationor buffer-exchange techniques, such as phenol-chloroform extraction,ethanol precipitation, magnetic bead capture, and/or silica membraneadsorption, washing, and elution. Nucleic acids may also be concentratedin this step.

Other methods for nucleic acid size selection may includemass-spectrometry or membrane-based filtration. In some embodiments ofmembrane-based filtration, nucleic acids are passed through a membrane(for example a silica membrane) that may preferentially bind to eitherdsDNA, ssDNA, or both. The membrane may be designed to preferentiallycapture nucleic acids of at least a certain size. For example, membranesmay be designed to filter out nucleic acids of less than 20, 30, 40, 50,70, 90, or more bases. Said membrane-based, size-selection techniquesmay not be as stringent as gel electrophoresis or chromatography,

F. Nucleic Acid Capture

Affinity-tagged nucleic acids may be used as sequence specific probesfor nucleic acid capture. The probe may be designed to complement atarget sequence within a pool of nucleic acids. Subsequently, the probemay be incubated with the nucleic acid pool and hybridized to itstarget. The incubation temperature may be below the melting temperatureof the probe to facilitate hybridization. The incubation temperature maybe up to 5, 10, 15, 20, 25, or more degrees Celsius below the meltingtemperature of the probe. The hybridized target may be captured to asolid-phase substrate that specifically binds the affinity tag. Thesolid-phase substrate may be a membrane, a well, a column, or a bead.Multiple rounds of washing may remove all non-hybridized nucleic acidsfrom the targets. The washing may occur at a temperature below themelting temperature of the probe to facilitate stable immobilization oftarget sequences during the wash. The washing temperature may be up to5, 10, 15, 20, 25, or more degrees Celsius below the melting temperatureof the probe. A final elution step may recover the nucleic acid targetsfrom the solid phase-substrate, as well as from the affinity taggedprobes. The elution step may occur at a temperature above the meltingtemperature of the probe to facilitate the release of nucleic acidtargets into an elution buffer. The elution temperature may be up to 5,10, 15, 20, 25, or more degrees Celsius above the melting temperature ofthe probe.

In some embodiments, biotin may be used as an affinity tag that isimmobilized by streptavidin on a solid-phase substrate. Biotinylatedoligos, for use as nucleic acid capture probes, may be designed andmanufactured. Oligos may be biotinylated on the 5′ or 3′ end. They mayalso be biotinylated internally on thymine residues. Increased biotin onan oligo may lead to stronger capture on the streptavidin substrate. Abiotin on the 3′ end of an oligo may block the oligo from extendingduring PCR. The biotin tag may be a variant of standard biotin. Forexample, the biotin variant may be biotin-TEG (triethylene glycol), dualbiotin, PC biotin, DesthioBiotin-TEG, and biotin Azide. Dual biotin mayincrease the biotin-streptavidin affinity. Biotin-TEG attaches thebiotin group onto a nucleic acid separated by a TEG linker. This mayprevent the biotin from interfering with the function of the nucleicacid probe, for example its hybridization to the target. A nucleic acidbiotin linker may also be attached to the probe. The nucleic acid linkermay comprise nucleic acid sequences that are not intended to hybridizeto the target.

The biotinylated nucleic acid probe may be designed with considerationfor how well it may hybridize to its target. Nucleic acid probes withhigher designed melting temperatures may hybridize to their targets morestrongly. Longer nucleic acid probes, as well as probes with higher GCcontent, may hybridize more strongly due to increased meltingtemperatures. Nucleic acid probes may have a length of a least 5, 10,15, 20, 30, 40, 50, or 100 bases, or more. Nucleic acid probes may havea GC content anywhere between 0 and 100%. Care may be taken to ensurethat the melting temperature of the probe does not exceed thetemperature tolerance of the streptavidin substrate. Nucleic acid probesmay be designed to avoid inhibitory secondary structures such ashairpins, homodimers, and heterodimers with off-target nucleic acids.There may be a tradeoff between probe melting temperature and off-targetbinding. There may be an optimal probe length and GC content at whichmelting temperature is high and off-target binding is low. A syntheticnucleic acid library may be designed such that its nucleic acidscomprise efficient probe binding sites.

The solid-phase streptavidin substrate may be magnetic beads. Magneticbeads may be immobilized using a magnetic strip or plate. The magneticstrip or plate may be brought into contact with a container toimmobilize the magnetic beads to the container. Conversely, the magneticstrip or plate may be removed from a container to release the magneticbeads from the container wall into a solution. different bead propertiesmay affect their application. Beads may have varying sizes. For examplebeads may be anywhere between 1 and 3 micrometers (um) in diameter.Beads may have a diameter of at most 1, 2, 3, 4, 5, 10, 15, 20, or moremicrometers. Bead surfaces may be hydrophobic or hydrophilic. Beads maybe coated with blocking proteins, for example BSA. Prior to use, beadsmay be washed or pre-treated with additives, such as blocking solutionto prevent them from non-specifically binding nucleic acids.

A Biotinylated probe may be coupled to the magnetic streptavidin beadsprior to incubation with the nucleic acid sample pool. This process maybe referred to as direct capture. Alternatively, the biotinylated probemay be incubated with the nucleic acid sample pool prior to the additionof magnetic streptavidin beads. This process may be referred to asindirect capture. The indirect capture method may improve target yield.shorter nucleic acid probes may require a shorter amount of time tocouple to the magnetic beads.

Optimal incubation of the nucleic acid probe with the nucleic acidsample may occur at a temperature that is 1 to 10 degrees Celsius ormore below the melting temperature of the probe. Incubation temperaturesmay be at most 5, 10, 20, 30, 40, 50, 60, 70, 80, or more degreesCelsius. The recommended incubation time may be 1 hour. The incubationtime may be at most 1, 5, 10, 20, 30, 60, 90, 120, or more minutes.Longer incubation times may lead to better capture efficiency. Anadditional 10 minutes of incubation may occur after the addition of thestreptavidin beads to allow biotin-streptavidin coupling. Thisadditional time may be at most 1, 5, 10, 20, 30, 60, 90, 120, or moreminutes. Incubation may occur in buffered solution with additives suchas sodium ion.

Hybridization of the probe to its target may be improved if the nucleicacid pool is single-stranded nucleic acid (as opposed todouble-stranded). Preparing a ssDNA pool from a dsDNA pool may entailperforming linear-PCR with one primer that commonly binds the edge ofall nucleic acid sequences in the pool. If the nucleic acid pool issynthetically created or assembled, then this common primer binding sitemay be included in the synthetic design. The product of the linear-PCRwill be ssDNA. More starting ssDNA template for the nucleic acid capturemay be generated with more cycles of linear-PCR. See Chemical MethodsSection D on PCR.

After the nucleic acid probes are hybridized to their targets andcoupled to magnetic streptavidin beads, the beads may be immobilized bya magnet and several rounds of washing may occur. Three to five washesmay be sufficient to remove non-target nucleic acids, but more or lessrounds of washing may be used. Each incremental wash may furtherdecrease non-targeted nucleic acids, but it may also decrease the yieldof target nucleic acids. To facilitate proper hybridization of thetarget nucleic acids to the probe during the wash step, a low incubationtemperature may be used. Temperatures as low as 60, 50, 40, 30, 20, 10,or 5 degrees Celsius or less may be used. The washing buffer maycomprise Tris buffered solution with sodium ion.

Optimal elution of the hybridized targets from the magnetic bead-coupledprobes may occur at a temperature that is equivalent to or more than themelting temperature of the probe. Higher temperatures will facilitatethe dissociation of the target to the probe. Elution temperatures may beat most 30, 40, 50, 60, 70, 80, or 90 degrees Celsius, or more. Elutionincubation time may be at most 1, 2, 5, 10, 30, 60 or more minutes.Typical incubation times may be approximately 5 minutes, but longerincubation times may improve yield. Elution buffer may be water ortris-buffered solution with additives such as EDTA.

Nucleic acid capture of target sequences containing at least one or moreof a set of distinct sites may be performed in one reaction withmultiple distinct probes for each of those sites. Nucleic acid captureof target sequences containing every member of a set of distinct sitesmay be performed in a series of capture reactions, one reaction for eachdistinct site using a probe for that particular site. The target yieldafter a series of capture reactions may be low, but the captured targetsmay subsequently be amplified with PCR. If the nucleic acid library issynthetically designed, then the targets may be designed with commonprimer binding sites for PCR.

Synthetic nucleic acid libraries may be created or assembled with commonprobe binding sites for general nucleic acid capture. These common sitesmay be used to selectively capture fully assembled or potentially fullyassembled nucleic acids from assembly reactions, thereby filtering outpartially assembled or mis-assembled (or unintended or undesirable)bi-products. For example, the assembly may involve assembling a nucleicacid with a probe binding site on each edge sequence such that only afully assembled nucleic product would contain the requisite two probebinding sites necessary to pass through a series of two capturereactions using each probe. In said example, a partially assembledproduct may contain neither or only one of the probe sites, andtherefore should not ultimately be captured. Likewise a mis-assembled(or unintended or undesirable) product may contain neither or only oneof the edge sequences. Therefore said mis-assembled product may notultimately be captured. For increased stringency, common probe bindingsites may be included on each component of an assembly. A subsequentseries of nucleic acid capture reactions using a probe for eachcomponent may isolate only fully assembled product (containing eachcomponent) from any bi-products of the assembly reaction. Subsequent PCRmay improve target enrichment, and subsequent size-selection may improvetarget stringency.

In some embodiments, nucleic acid capture may be used to selectivelycapture a targeted subset of nucleic acids from a pool. For example, byusing probes with binding sites that only appear on said targeted subsetof nucleic acids. Synthetic nucleic acid libraries may be created orassembled such that nucleic acids belonging to potential sub-librariesof interest all share common probe binding sites (common within thesub-library but distinct from other sub-libraries) for the selectivecapture of the sub-library from the more general library.

G. Lyophilization

Lyophilization is a dehydration process. Both nucleic acids and enzymesmay be lyophilized. Lyophilized substances may have longer lifetimes.Additives such as chemical stabilizers may be used to maintainfunctional products (e.g., active enzymes) through the lyophilizationprocess. Disaccharides, such as sucrose and trehalose, may be used aschemical stabilizers.

H. DNA Design

The sequences of nucleic acids (e.g., components) for building syntheticlibraries (e.g., identifier libraries) may be designed to avoidsynthesis, sequencing, and assembly complications. Moreover, they may bedesigned to decrease the cost of building the synthetic library and toimprove the lifetime over which the synthetic library may be stored.

Nucleic acids may be designed to avoid long strings of homopolymers (orrepeated base sequences) that may be difficult to synthesize. Nucleicacids may be designed to avoid homopolymers of length greater than 2, 3,4, 5, 6, 7 or more. Moreover, nucleic acids may be designed to avoid theformation of secondary structures, such as hairpin loops, that mayinhibit their synthesis process. For example, predictive software may beused to generate nucleic acid sequence that do not form stable secondarystructures. Nucleic acids for building synthetic libraries may bedesigned to be short. Longer nucleic acids may be more difficult andexpensive to synthesize. Longer nucleic acids may also have a higherchance of mutations during synthesis. Nucleic acids (e.g., components)may be at most 5, 10, 15, 20, 25, 30, 40, 50, 60 or more bases.

Nucleic acids to become components in an assembly reaction may bedesigned to facilitate that assembly reaction. See Appendices A and Bfor more information on nucleic acid sequence considerations for OEPCRand ligation-based assembly reactions, respectively. Efficient assemblyreactions typically involve hybridization between adjacent components.Sequences may be designed to promote these on-target hybridizationevents while avoiding potential off-target hybridizations. Nucleic acidbase modifications, such as locked nucleic acids (LNAs), may be used tostrengthen on-target hybridization. These modified nucleic acids may beused, for example, as staples in staple strand ligation or as stickyends in sticky-strand ligation. Other modified bases that may be usedfor building synthetic nucleic acid libraries (or identifier libraries)include 2,6-Diaminopurine, 5-Bromo dU, deoxyUridine, inverted dT,inverted diDeoxy-T, Dideoxy-C, 5-Methyl dC, deoxylnosine, Super T, SuperG, or 5-Nitroindole. Nucleic acids may contain one or multiple of thesame or different modified bases. Some of the said modified bases arenatural base analogs (for example, 5-Methyl dC and 2,6-Diaminopurine)that have higher melting temperatures and may therefore be useful forfacilitating specific hybridization events in assembly reactions. Someof the said modified bases are universal bases (for example,5-Nitroindole) that can bind to all natural bases and may therefore beuseful for facilitating hybridization with nucleic acids that may havevariable sequences within desirable binding sites. In addition to theirbeneficial roles in assembly reactions, these modified bases may beuseful in primers (e.g., for PCR) and probes (e.g., for nucleic acidcapture) as they may facilitate the specific binding of primers andprobes to their target nucleic acids within a pool of nucleic acids. SeeChemical Methods Section D and F for more nucleic acid designconsiderations with regard to nucleic acid amplification (or PCR) andnucleic acid capture, respectively.

Nucleic acids may be designed to facilitate sequencing. For example,nucleic acids may be designed to avoid typical sequencing complicationssuch as secondary structure, stretches of homopolymers, repetitivesequences, and sequences with too high or too low of a GC content.Certain sequencers or sequencing methods may be error prone. Nucleicacid sequences (or components) that make up synthetic libraries (e.g.,identifier libraries) may be designed with certain hamming distancesfrom each other. This way, even when base resolution errors occur at ahigh rate in sequencing, the stretches of error-containing sequences maystill be mapped back to their most likely nucleic acid (or component).Nucleic acid sequences may be designed with hamming distances of atleast 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or more basemutations. Alternative distance metrics from hamming distance may alsobe used to define a minimum requisite distance between designed nucleicacids.

Some sequencing methods and instruments may require input nucleic acidsto contain particular sequences, such as adapter sequences orprimer-binding sites. These sequences may be referred to as“method-specific sequences”. Typical preparatory workflows for saidsequencing instruments and methods may involve assembling themethod-specific sequences to the nucleic acid libraries. However, if itis known ahead of time that a synthetic nucleic acid library (e.g.,identifier library) will be sequenced with a particular instrument ormethod, then these method-specific sequences may be designed into thenucleic acids (e.g., components) that comprise the library (e.g.,identifier library). For example, sequencing adapters may be assembledonto the members of a synthetic nucleic acid library in the samereaction step as when the members of a synthetic nucleic acid libraryare themselves assembled from individual nucleic acid components.

Nucleic acids may be designed to avoid sequences that may facilitate DNAdamage. For example, sequences containing sites for site-specificnucleases may be avoided. As another example, UVB (ultraviolet-B) lightmay cause adjacent thymines to form pyrimidine dimers which may theninhibit sequencing and PCR. Therefore, if a synthetic nucleic acidlibrary is intended to be stored in an environment exposed to UVB, thenit may be beneficial to design its nucleic acid sequences to avoidadjacent thymines (i.e., TT) or adjacent cytosines (i.e., CC).

All information contained within the Chemical Methods section isintended to support and enable the aforementioned technologies, methods,protocols, systems, and processes

EXAMPLES Example 1: Encoding, Writing and Reading a Single Poem in DNAMolecules

Data to be encoded is a textfile containing a poem. The data is encodedmanually with pipettes to mix together DNA components from two layers of96 components to construct identifiers using the product schemeimplemented with overlap extension PCR. The first layer, X, comprises 96total DNA components. The second layer, Y, also comprises 96 totalcomponents. Prior to writing the DNA, the data is mapped to binary andthen recoded to a uniform weight format where every contiguous (adjacentdisjoint) string of 61 bits of the original data is translated to a 96bit string with exactly 17 bit-values of 1. This uniform weight formatmay have natural error checking qualities. The data is then hashed intoa 96 by 96 table to form a reference map.

The middle panel of FIG. 18A shows the two-dimensional reference map ofa 96 by 96 table encoding the poem into a plurality of identifiers. Darkpoints correspond to a ‘1’ bit-value and white points corresponded to a‘0’ bit-value. The data is encoded into identifiers using two layers of96 components. Each X value and Y value of the table is assigned acomponent and the X and Y components are assembled into an identifierusing overlap extension PCR for each (X,Y) coordinate with a ‘1’ value.The data was read back (e.g., decoded) by sequencing the identifierlibrary to determine the presence or absence of each possible (X,Y)assembly.

The right panel of figure FIG. 18A shows a two-dimensional heat map ofthe abundances of sequences present in the identifier library asdetermined by sequencing. Each pixel represents a molecule comprisingthe corresponding X and Y components, and the greyscale intensity atthat pixel represents the relative abundance of that molecule comparedto other molecules. Identifiers are taken as the top 17 most abundant(X, Y) assemblies in each row (as the uniform weight encoding guaranteesthat each contiguous string of 96 bits may have exactly 17 ‘1’ values,and hence 17 corresponding identifiers).

Example 2: Encoding a 62824 Bit Textfile

Data to be encoded is a textfile of three poems totaling 62824 bits. Thedata is encoded using a Labcyte Echo® Liquid Handler to mix together DNAcomponents from two layers of 384 components to construct identifiersusing the product scheme implemented with overlap extension PCR. Thefirst layer, X, comprises 384 total DNA components. The second layer, Y,also comprises 384 total components. Prior to writing the DNA, the datais mapped to binary and then recoded to decrease the weight (number ofbit-values of ‘1’) and include checksums. The checksums are establishedso that there is an identifier that corresponds to a checksum for everycontiguous string of 192 bits of data. The re-coded data has a weight ofapproximately 10,100, which corresponds to the number of identifiers tobe constructed. The data may then be hashed into a 384 by 384 table toform a reference map.

The middle panel of FIG. 18B shows a two-dimensional reference map of a384 by 384 table encoding the textfile into a plurality of identifiers.Each coordinate (X,Y) corresponds to the bit of data at positionX+(Y−1)*192. Black points correspond to a bit value of ‘1’ and whitepoints correspond to a bit value of ‘0’. The black points on the rightside of the figure are the checksums and the pattern of black points onthe top of the figure is the codebook (e.g., dictionary for de-codingthe data). Each X value and Y value of the table may be assigned acomponent and the X and Y components are assembled into an identifierusing overlap extension PCR for each (X, Y) coordinate with a ‘1’ value.The data was read back (e.g., decoded) by sequencing the identifierlibrary to determine the presence or absence of each possible (X, Y)assembly.

The right panel of FIG. 18B shows a two-dimensional heat map of theabundances of sequences present in the identifier library as determinedby sequencing. Each pixel represents a molecule comprising thecorresponding X and Y components, and the greyscale intensity at thatpixel represents the relative abundance of that molecule compared toother molecules. Identifiers are taken as the top S most abundant (X, Y)assemblies in each row, where S for each row may be the checksum value.

Example 3: A Comparison of 5′ Versus 3′ Overhangs and 4-Base Versus6-Base Overhangs on a 15-Piece, Sticky-End Ligation

Table 1 presents the measured ligation efficiency of 4 different sets of15-DNA components labeled the following: 6/24/6 3′, 6/24/6 5′, 4/24/43′, and 4/24/4 5′. The first 3 numbers in the label, X/Y/Z, indicatesthe form of each DNA component in the set with an X-base overhang on oneend, a Y-base duplex (or barcode) region in the middle, and a Z-baseoverhang on the other end. The final number in each label (preceding theapostrophe) indicates whether the overhangs in the set are 5′ or 3′.Ligation was performed at 37° C. with 0.067 μM each DNA component, 5CEU/μL of T4 Ligase (CEU=Cohesive End Unit), 7.5% w/v PEG6000, 20% v/vglycerol, and standard T4 ligase buffer parts. Ligation time was 2.5minutes. Efficiency was measured with qPCR relative to a full lengthcontrol (FLC) representing the fully ligated product for each possibleset.

TABLE 1 Measured ligation efficiency 15-component set Average ligationefficiency Sandard deviation 6/24/6 5′ 0.2471% 0.0750% 6/24/6 5′ 0.7237%0.1059% 6/24/6 5′ 0.0275% 0.0047% 6/24/6 3′ 0.2221% 0.0470% 6/24/6 3′0.0490% 0.0068% 6/24/6 3′ 0.0398% 0.0077% 4/24/4 5′ 0.0008% 0.0001%4/24/4 5′ 0.0008% 0.0002% 4/24/4 5′ 0.0003% 0.0000% 4/24/4 3′ 0.0014%0.0003% 4/24/4 3′ 0.0047% 0.0005% 4/24/4 3′ 0.0008% 0.0002%

FIG. 22 presents a gel electrophoresis image of the qPCR products fromone of each of the 4 different experimental ligation reactions alongsidetheir respective FLCs, which have a length of around 450 bases. Togetherwith Table 1, results indicate that 6-base overhangs led to higherligation efficiency and specificity of full length product than 4-baseoverhangs. No obvious pattern in efficiency is observed regarding theuse of 5′ overhangs versus 3′ overhangs.

FIGS. 23A and 23B present data for ligation efficiency of 6/24/6 3′(FIG. 23B) and 6/24/6 5′ (FIG. 23A) DNA component sets ligated for 2,2.5, 3, and 1440 minutes. FIGS. 23A and 23B show ligation efficiency asmeasured by qPCR relative to the FLC for each set. FIG. 23C shows a gelelectrophoresis image of the qPCR products alongside their FLCs, whichhave a length of around 450 bases. Results also indicate that the 3′overhang set may have higher specificity than the 5′ overhang set.

Example 4: Testing the Effect of Overhang Length, Overhang MeltingTemperature, and Overhang GC Content on Sticky-End Ligation Efficiency

Table 2 presents the characteristics of 9 different sticky-ended (with3′ overhang) DNA component pairs designed to have different lengthoverhangs (short=6-base, medium=8-base, and long=10-base), different GCcontents (low, medium, and high), and different melting temperatures(Tm). The overhangs themselves are given in the cells of the table alongwith their predicted melting temperatures in degrees Celsius. Ligationwas performed on each DNA component pair at 37° C. with 0.067 μM eachDNA component, 5 CEU/μL of T4 Ligase, 7.5% w/v PEG6000, 20% v/vglycerol, and standard T4 ligase buffer parts. Ligation was performed at2.5 minutes and 60 minutes. Efficiency was measured using qPCR relativeto a full length control representing the fully ligated product for eachpair.

TABLE 2 Characteristics of different sticky-ended (with 3′ overhang) DNAcomponent pairs ShortLength (6) MedLength (8) HighLength (10) LowGCPair 1 Pair 4 Pair 7 Tm = −4.3, CAAGAA Tm = 8.4, TAGATAAGTm = 21.4, TAGTATAAGA MedGC Pair 2 Pair 5 Pair 8 Tm = 9.0, CCTCGATm = 20.8, CCAATACC Tm = 37.4, GAGAGAGGTC HighGC Pair 3 Pair 6 Pair 9Tm = 20.7, GCCCCC Tm = 37.4, CGAACGCC Tm = 51.2, CGCCACCCAC

FIGS. 24A and 24B present the ligation efficiency for these DNAcomponent pairs grouped by overhang lengths. FIG. 24A shows the 2.5minute ligation efficiencies and FIG. 24B shows the ratio ofefficiencies between the 2.5 and 60 minute timepoints. Results indicatethat ligation rate may be higher when shorter overhangs are used.

FIGS. 25A and 25B present the ligation efficiency for these DNAcomponent pairs grouped by GC content. FIG. 25A shows the 2.5 minuteligation efficiencies and FIG. 25B shows the ratio of efficienciesbetween the 2.5 and 60 minute timepoints. Results indicate that theremay not be large differences in ligation rate for overhangs of differentGC contents (or melting temperatures), but that there may be a slightlyhigher ligation rate when overhangs with higher GC content (or meltingtemperature) are used. The melting temperatures correlate with GCcontent.

Example 5: Testing the Effect of Temperature on Ligation Efficiency

FIG. 26 presents data from the ligation of 4 sticky-ended (with 6-base,3′ overhangs) DNA components, ligated together with T4 ligase at varioustemperatures. Ligation was performed with 0.25 μM each DNA component, 5CEU/pL or 20 CEU/μL of T4 Ligase, 7.5% w/v PEG6000, 20% v/v glycerol,and standard T4 ligase buffer parts. Ligation time was 2.5 minutes.Efficiency was measured using qPCR relative to a full length controlrepresenting the fully ligated product. Results indicate that highertemperatures and higher ligase concentrations may increase ligationefficiency with T4 ligase.

FIG. 27 presents data from the ligation of 4 sticky-ended (with 6-base,3′ overhangs) DNA components, ligated together with T4 ligase at varioustemperatures. Ligation was performed with 0.125 μM each DNA component, 5CEU/pL T4 Ligase (in 20 μL, so 100 CEU total), 7.5% w/v PEG6000, 20% v/vglycerol, and standard T4 ligase buffer parts. Ligation time was 2.5minutes. Efficiency was measured using qPCR relative to a full lengthcontrol representing the fully ligated product. Results indicate thathigher temperatures and higher ligase concentrations may increaseligation efficiency with T4 ligase. Results indicate a similar trend asobserved in FIG. 26 .

Example 6: Testing the Effect of Ligase Type on Ligation Efficiency

FIGS. 28A and 28B present data for ligation efficiencies of T7 (FIG.28A) and T3 (FIG. 28B) DNA ligase, as compared to T4 DNA ligase.Ligation was performed on 4 sticky-ended (with 6-base, 3′ overhangs) DNAcomponents at 25° C. with 0.125 μM each DNA component. Ligation time was2.5 minutes. Efficiency was measured using qPCR relative to a fulllength control representing the fully ligated product. Ligaseconcentrations varied between and 100 CEU/μL. Within each plot,efficiencies are compared to the same ligation performed with T4 DNAligase at 5 CEU/μL. Results indicate that T3 ligase at a concentrationof around 100 CEU/uL may be the optimal ligase for room temperatureligations.

FIG. 29 presents data for ligation efficiencies of E. coli DNA Ligase atvarious concentrations. Ligation was performed on 4 sticky-ended (with6-base, 3′ overhangs) DNA components at 25° C. with 0.125 μM each DNAcomponent. Ligation time was 2.5 minutes. Efficiency was measured usingqPCR relative to a full length control representing the fully ligatedproduct. Ligase concentrations varied between 1 and 100 CEU/μL.

Table 3 presents average ligation efficiency measurements for 4different types of ligase. Ligation was performed on 15 sticky-ended(with 6-base, 3′ overhangs) DNA components at 25° C. with 0.268 μM eachDNA component. Ligation time was 2.5 minutes. Efficiency was measuredusing qPCR relative to a full length control representing the fullyligated product. T4 was at 20 CEU/pL, and T3 and T7 were each at 150CEU/μL.

TABLE 3 Average ligation efficiency measurements Ligation EfficiencyStDev T4 0.039% 0.004% T4 + 7.5% PEG600 0.298% 0.012% T7 0.419% 0.043%T3 0.804% 0.237%

FIGS. 30A and 30B present data from the ligation of 4 sticky-ended (with6-base, 3′ overhangs) DNA components, ligated together with T7 DNAligase (FIG. 30A) or T3 DNA ligase (FIG. 30B) at various temperatures.Ligation was performed with 0.125 μM each DNA component and 150 CEU/μLT7 or T3 DNA Ligase. Ligation time was 2.5 minutes. Efficiency wasmeasured using qPCR relative to a full length control representing thefully ligated product. Results indicate that T3 and T7 may loseefficiency between 20° C. and 40° C., with T3 dropping faster, buthaving a higher efficiency at lower temperatures (e.g., 15 to 20° C.).This indicates that at higher temperature incubations (e.g., 37° C.), T4DNA ligase (see, e.g., FIG. 26 and FIG. 27 ) may perform better than T3and T7 DNA ligase.

Example 7: Testing the Effect of Polyethyleneglycol (PEG) on LigationEfficiency

FIG. 31A-C present data from ligation of 4 sticky-ended (with 10-base,3′ overhangs) DNA components ligated together with various amounts (interms of percent weight-per-volume) of PEG8000 (FIG. 31A), PEG6000 (FIG.31B), and PEG400 (FIG. 31C). Ligation was performed with 0.125 μM eachDNA component and 5 CEU/μL T4 ligase at 25° C. Ligation time was 2.5minutes. Efficiency was measured using qPCR relative to a full lengthcontrol representing the fully ligated product. Results indicate thatadding PEG up to a particular amount to a ligation may improveefficiency, but then inhibit efficiency beyond a certain amount. Theamount of PEG that may be added to a ligation reaction to improveefficiency depends on the molecular weight of the PEG.

FIG. 32 presents data from ligation of 4 sticky-ended (with 10-base, 3′overhangs) DNA components ligated together in the presence of eitherPEG400 or PEG6000 at low weight-per-volume concentrations. Ligation wasperformed with 0.125 μM each DNA component, 5 CEU/μL T4 DNA ligase, 20%v/v glycerol, and standard T4 ligase buffer parts at 37° C. Ligationtime was 2.5 minutes. Efficiency was measured using qPCR relative to afull length control representing the fully ligated product. Resultsindicate that under these conditions, adding PEG6000 may improveligation efficiency more than adding and equivalent amount (by weight)of PEG400.

Example 8: A Comparison of Ligation Deactivation Methods

FIG. 33 presents data on using buffer QG or EDTA to inactivate ligase.Ligation was performed on 4 sticky-ended DNA components. The buffer QGrefers to buffer QG manufactured by Qiagen or a buffer with similarcomponents (e.g., 5.5 M guanidine thiocyanate (GuSCN), 20 mM Tris HCl pH6.6). In the control group, T4 ligase was used under standard bufferconditions at room temperature in the given volume indicated on thehorizontal axis. In the experimental group, the T4 ligase reaction mixwas treated with the indicated additive prior to being added to the DNAcomponents to make a reaction of the given volume. Ligation time was 2.5minutes. The vertical axis shows Ct results obtained from qPCR on thefull length product of each ligation. Note that Ct represents a Logbase-2 scale for concentration. Results indicate that using EDTA orbuffer QG may deactivate ligase. The results of the ligation groups withEDTA and buffer QG deactivated ligase look similar to the results of theno ligase group.

Example 9: A Study of DNA Replication

FIG. 34 presents data on the linearity of replication using Q5, Phusion,and Taq DNA polymerase. The horizontal axis represents theoreticaltarget DNA concentration (ng/μL), and the vertical axis representsmeasured target DNA concentration (ng/μL) using qPCR relative to astandard. Measurements were taken at different cycles of PCR reaction.The dots on the full diagonal represent full linearity (theoretical).Other dots represent experimental data points from different ligases.Results indicate that standard PCR reactions (regardless of ligase) maybe linear up to or beyond 10 ng/μL of target. In this example, thetarget DNA used was ˜450 bases.

Example 10: A Study of Different Methods for Drying DNA

FIG. 35 presents data for DNA samples stored at room temperature for 4days. Different amounts of DNA samples containing DNA of about 450 basesin length were stored (50 ng, 500 ng, and 5000 ng). The DNA samples werestored in different conditions: wet or dry, and with or withoutpreserving additive (e.g., BM represents biostabilizing material).Results were compared to the same DNA samples containing DNA of about450 bases in length stored in frozen water during those 4 days. Resultsindicate that minimal DNA degradation may take place at room temperatureand that the use of preserving additive, like BM (biostabilizingmaterial), may contribute to decreased degradation. The drying processmay lead to DNA degradation without the presence of DNA preservingadditive.

FIG. 36 presents data for DNA repeatedly being dried and re-hydrated atroom temperature. Results are shown for DNA with and without preservingadditive (e.g., BM represents biostabilizing material). Results indicatethat Drying/rehydration of DNA samples 3-4 times with and withoutpreserving additive can be achieved without losing substantial amountsof DNA.

Example 11: Designing and Testing 6 Base Overhangs for Ligation

Table 4 presents a set of 32 computationally designed 3′ overhangs. Theoverhangs (and their reverse complements) were designed to have a lengthof 6 bases, no homopolymers of more than 3 bases, no hamming distancesless than 3 bases between each other, no equivalent substrings of morethan 3 bases between each other, and no equivalent substring of morethan 2 bases from each other for substrings on either edge of theoverhang.

TABLE 4 A set of 32 computationally designed 3′ overhangs ID sequence  1GAGAAC  2 TCTATC  3 CCATCT  4 TTTACT  5 TGTGTA  6 ACCCAC  7 CCTTTG  8TCGTGC  9 CTCGCC 10 GCCTAA 11 AGGGTC 12 CAGCGT 13 CTACAT 14 GTCATG 15CGTCGC 16 GAATAT 17 ATTTGA 18 AAACTA 19 TGCCGG 20 TGACCC 21 CTGATA 22AGCAGC 23 GGAATT 24 GGTTAC 25 CTTGGG 26 TGGAGT 27 ATCCTT 28 CGGCAA 29TCCGTT 30 CACTCG 31 TAAGAA 32 CGCTGT

Table 5 presents another set of 32 computationally designed 3′overhangs. This set of 6-base overhangs (and their reverse complements)were designed to be overall less stringently constrained than those ofTable 4, but to contain subsets of 16 overhangs within that met theequivalent constraints to those in Table 4. The two bolded sequenceswere designed to be reverse complements of each other, as a control fora combinatorial experiment.

TABLE 5 A set of 32 computationally designed 3' overhangs ID sequence  1CGTTAC  2 GTCTCG  3 GTTGAC  4 ACTGAG  5 TACCAC  6 CATCCA  7 CCTTCA  8TCTACG  9 TCGAAA 10 TGTTCC 11 GCATAG 12 CCAAAG 13 CGAGAC 14 CAATCG 15CAAGAC 16 GTTAGG 17 TAGGCC 18 TTAGCT 19 TCATTC 20 AGGCGG 21 TTGCTT 22GAGTTT 23 TCCTGT 24 TAAGTG 25 CGCCAT 26 ATCGGC 27 TGCACT 28 GCGACC 29GGGAAT 30 AATAGC 31 AACTCT 32 GATCAG

Sticky-end DNA sequences for each overhang and their reverse complementsin Table 4 and Table 5 were constructed. Each sequence for each overhang(and reverse complement) in each table had the same proximal duplexregion but was uniquely barcoded on its distal end with a distinct3-base 5′ overhang. See FIG. 37 for the scheme of the constructed stickyend sequences. In total, with reverse complements, 64 sequences wereconstructed for each table. Those sequences were pooled in equimolarconcentration and ligated with T4 ligase at 37° C. in standard ligasebuffer. Ligation was performed for 2.5 minutes prior to being quenchedwith EDTA. Ligated sequences were purified through gel extraction andthen 5′ ends were filled and dA-tailed using Klenow Polymerase.Sequencing adapters were subsequently ligated to the ends of theproducts, and amplified and purified to prepare for sequencing on theIllumina iSeq. The relative copy number of each possible ligated productwas inferred by counting the number of sequence reads for each possiblecombination of barcodes. There were 64×(64+1)/2=2080 possible productsin total for each set of overhangs (Table 4 and Table 5), 64 of which ineach correspond to overhangs ligated to their correct reverse complementpartners.

FIG. 38 presents the data from the ligation of the set of overhangsequences in Table 4 (FIG. 38A) and Table 5 (FIG. 38B). Each pixel ineach heatmap corresponds to the ligation product formed by the overhangsthat represent the row and column of that pixel. The greyscale (or“heat”) of the pixel represents the relative amount of that ligationproduct (in log base-2 scale). Each row and column corresponds to anoverhang 1-32 from either Table 4 (FIG. 38A) or Table 5 (FIG. 38B) andthen the reverse complements of those overhangs. Results suggest thateach overhang ligates most strongly with its reverse complement, butthat multiple non-specific products may also be formed in a ligation.

These data were used to calculate penalty scores for subsets ofoverhangs from each set of 32 overhangs. For a subset of overhangs,penalty scores were calculated by adding the relative amount ofoff-target product formed for each possible overhang in the subset(compared to the amount of correct product) in the data set.

FIG. 39 presents penalty scores from 2M subsets of 15 overhangs fromeach the set of overhangs in Table 4 and Table 5. Penalty scores may beused to predict high-efficiency, high-specificity sets of 15 overhangsto be used in 16 component ligation. Top candidates may be found withthe lowest penalty score. Similar analysis may be done with subsets of Xoverhangs to find top overhang candidates for ligating together X+1overhangs. Based on this analysis, Table 6 presents putativehigh-efficiency, high-specificity subsets of 15 overhangs (taken fromthe set in Table 4) for ligating together 16 DNA components. Likewise,Table 7 presents putative subsets of 15 overhangs (taken from the set inTable 5) for ligating together 16 DNA components.

TABLE 6 Putative high-efficiency, high-specificity subsets of 15overhangs Penalty score Overhang IDs from Table 4 0.51 [3, 5, 7, 8, 9,11, 13, 14, 17, 21, 23, 24, 25, 28, 30] 0.52 [3, 4, 7, 11, 12, 13, 17,21, 23, 24, 25, 26, 28, 30, 32] 0.54 [3, 4, 7, 11, 12, 13, 14, 15, 23,24, 25, 26, 28, 30, 32] 0.58 [6, 7, 8, 9, 11, 12, 14, 17, 18, 20, 21,23, 25, 28, 30]

TABLE 7 Putative subsets of 15 overhangs Penalty score Overhang IDs fromTable 5 0.42 [1, 4, 6, 15, 17, 19, 20, 21, 22, 24, 25, 26, 28, 30, 32]0.43 [4, 6, 8, 15, 17, 19, 20, 21, 22, 23, 24, 25, 27, 30, 32] 0.44 [4,5, 6, 15, 16, 17, 20, 21, 22, 24, 25, 27, 28, 30, 32] 0.45 [4, 5, 6, 7,8, 15, 17, 19, 20, 21, 24, 25, 27, 30, 32] 2.1 [1, 2, 3, 4, 5, 7, 8, 9,10, 11, 12, 14, 15, 16, 17]

FIG. 40 presents data for the ligation efficiency of 16 DNA componentsusing the overhangs from the final (bold) row of Table 7 and aparticular formulation of ligation mix that may be optimized fordispensing out of a printhead. The mix contains humectant in the form ofglycerol, dye in the form of Orange G, and biocide in the form ofNipacide. Ligation was performed at two ligase concentrations—0.1 Weissunits/μL and 0.2 Weiss units/μL. Moreover, ligation was performed with0.0625 μM each DNA component, 22.5% v/v glycerol, 3.1% w/v PEG6000,1.25% w/v orange G dye, 0.1% w/v Nipacide, and standard T4 ligase bufferparts at 37° C. Ligation time was 2.5 minutes. Efficiency was measuredusing qPCR relative to a full length control representing the fullyligated product.

Example 12: Encoding to, Replicating, and Accessing from 60 kb ofDigital Information

A digitized audio clip (“message”) of length 68,800 bits (73,440 bitsafter error protection) was encoded using a component library of 372 DNAcomponents in an eight-layer product scheme (see FIG. 16B for productscheme overview). There were 7 layers of 3 components (the “baselayers”) and one layer (the “multiplex layer”) of 351 components, andtherefore 767637 possible identifiers, but the encoded message only used119353 identifiers from the combinatorial space. The writing wasperformed on the Labcyte Echo 555 Access System. The process wasrepeated twice. DNA components were designed computationally andconstructed by duplexing manufactured oligos.

The writing process occurred in 4 phases: (1) computational encoding,(2) DNA component collocation, (3) ligation, and (4) consolidation.During (1) computational encoding, the error corrected message wasencoded into contiguous codewords of length 13 and weight 3. Hencecodewords were represented by 13 lexicographically ordered identifiers,3 of which were intended to be present (“true identifiers”), and theother 10 intended not to be present (“false identifiers”). There were9181 codewords in total. In (2) DNA collocation, the 372 DNA componentswere mixed together in 341 reaction wells (of a 384-well plate) usingthe Labcyte Echo 555. Each reaction was intended to create 27 contiguouscodewords (81 true identifiers total), except for one reaction, whichwas intended to create only one codeword (3 true identifiers total).Reactions were setup to contain one DNA component from each of the baselayers and multiple components from the multiplex layer (3 for eachcodeword). Additionally, sequencing adapters to ligate onto each end ofthe fully formed identifiers were added to reaction wells. In (3)ligation, 4 uL of T4 ligase reaction mix (containing 5 CEU/μL of T4ligase, and 7.5% PEG6000) was added to each reaction well and incubatedat 37° C. for 1 hour. Concentrations were set up such that each reactioncontained approximately 4 nM of aggregate DNA components from eachlayer. Subsequently, in (4) consolidation, approximately 50 nL of everyreaction was consolidated into one container with EDTA solution todeactivate the ligase activity. The consolidated pool of identifiers(the identifier library) was amplified using PCR and gel purified toextract full length identifiers for sequencing.

FIGS. 41A-B present data recovered from sequencing the identifierlibrary that encodes the message. FIG. 41A shows a 341×351 reference mapof the encoded message (after computational encoding). Dark pointscorrespond to a ‘1’ bit-value and white points corresponded to a ‘0’bit-value. The data is written in DNA by constructing identifierscorresponding to the positions of the ‘1’ bit-values (which is possiblebecause the identifiers have a lexicographic order). FIG. 41B shows aheat map (341×351) of the abundances of sequences present in theidentifier library as determined by sequencing. Each pixel represents anidentifier and the greyscale intensity at that pixel represents therelative abundance of that identifier compared to other identifiers inthe row. Identifiers of each row are constructed in the same reaction.Maximum greyscale (dark) intensity is set at the average copy number foridentifiers in each row. Identifiers may be interpreted as trueidentifiers (identifiers that represent bit values of ‘1’) if they arewithin the top 3 most abundant identifiers in a contiguous string of 13identifiers (along the rows of the map). All others are interpreted tobe false identifiers (identifiers that represent bit values of ‘0’).Applying this decoding processing step to the data results in zeroidentifier errors (events where, within a codeword, a false identifierhas more reads than a true identifier) and zero identifier erasures(events where the top 3 most abundant identifiers cannot bedistinguished). Therefore the decoded message exactly matches theencoded message (FIG. 41A). FIG. 42 presents data from a duplicate runof the entire encoding, writing, sequencing, and decoding process.Again, the message was successfully written and read with zero errors orerasures.

FIGS. 43A-C present data from creating multiple copies of the originalidentifier library containing the message (from FIGS. 41A-B). Thelibrary was diluted 1000× and then amplified with 10 cycles of PCR withPhusion polymerase and primers that bound to the outer edges of theadapter sequences (common to all sequences in the library). The 10-cyclePCR amplified the library ˜1024× back to its original concentration.FIG. 43A shows a heat map (341×351) of the abundances of sequencespresent in the replicated identifier library as determined bysequencing. Each pixel represents an identifier and the greyscaleintensity at that pixel represents the relative abundance of thatidentifier compared to other identifiers in the row. Maximum greyscale(dark) intensity is set at the average copy number for identifiers ineach row. Identifiers may be interpreted to represent bit values of ‘1’if they are within the top 3 most abundant identifiers in a contiguousstring of 13 identifiers (along the rows of the map). All others areinterpreted to represent bit values of ‘0’. Applying this decodingprocessing step to the data results in zero identifier errors. There wasone identifier erasure, which may be explained by small sequencingsample size (see Table 8). It was a codeword in which all falseidentifiers had zero reads, but one of the true identifiers also hadzero reads. FIG. 43B shows the correlation between identifier copynumbers in the original identifier library versus the replicatedidentifier library, and FIG. 43C shows the distribution of identifiercopy numbers in the original identifier library versus the replicatedidentifier library. Results indicate that little or no bias may occurduring identifier library replication.

FIGS. 44A-C present data from accessing a portion of the identifierlibrary containing the original message (from FIGS. 41A-B). The accessmethod was an ‘AND’ operation as described in FIG. 17B. The identifierlibrary was diluted ˜32000× and then amplified using PCR with primersthat bound to a specific DNA component of each edge layer, thusaccessing approximately 1/9^(th) of the library (since each layer had 3possible components). The PCR was performed with Phusion polymerase for15 cycles. Sequencing adapters were ligated onto the ends of theresulting sub-library, and it was sequenced on the Illumina iSeq. FIG.44A shows a heat map (341×351) of the abundances of sequences present inthe accessed identifier library as determined by sequencing. Each pixelrepresents an identifier and the greyscale intensity at that pixelrepresents the relative abundance of that identifier compared to otheridentifiers in the row. Maximum greyscale (dark) intensity is set at theaverage copy number for identifiers in each row. Identifiers may beinterpreted to represent bit values of ‘1’ if they are within the top 3most abundant identifiers in a contiguous string of 13 identifiers(along the rows of the map). All others are interpreted to represent bitvalues of ‘0’. Applying this decoding processing step to the dataresults in zero identifier errors and zero identifier erasures, andtherefore a dataset that exactly matches the encoded message (FIG. 41A).FIG. 44B shows the correlation between identifier copy numbers in theoriginal library versus the accessed identifier library, and FIG. 44Cshows the distribution of identifier copy numbers in the originalidentifier library versus the accessed identifier library. Resultsindicate that little or no bias may occur during identifier libraryaccess.

FIGS. 45A-C present data from further accessing a sub-portion of theaccessed identifier library (from FIGS. 44A-C). The access method fromthe original identifier library was two nested ‘AND’ operations (whereeach ‘AND’ was as described in FIG. 17B). The original identifierlibrary was diluted ˜32000× and then amplified using PCR with primersthat bound to a specific DNA component of each edge layer, thusaccessing approximately 1/9^(th) of the library (since each layer had 3possible components). The resulting accessed identifier library wasdiluted again ˜32000× and then amplified using PCR with primers thatbound to specific DNA components on layers one removed from each edge,thus accessing approximately 1/9^(th) of the accessed library (sinceeach layer had 3 possible components), or approximately 1/81 of theoriginal library overall ( 1/9^(th) of 1/9^(th)). We refer to theresulting sub-library as the “2× accessed” identifier library. The PCRwas performed with Phusion polymerase for 15 cycles. Sequencing adapterswere ligated onto the ends of the resulting sub-library, and it wassequenced on the Illumina iSeq. FIG. 45A shows a heat map (341×351) ofthe abundances of sequences present in the 2× accessed identifierlibrary as determined by sequencing. Each pixel represents an identifierand the greyscale intensity at that pixel represents the relativeabundance of that identifier compared to other identifiers in the row.Maximum greyscale (dark) intensity is set at the average copy number foridentifiers in each row. Identifiers may be interpreted to represent bitvalues of ‘1’ if they are within the top 3 most abundant identifiers ina contiguous string of 13 identifiers (along the rows of the map). Allothers are interpreted to represent bit values of ‘0’. Applying thisdecoding processing step to the data results in zero identifier errorsand zero identifier erasures, and therefore a dataset that exactlymatches the encoded message (FIG. 41A). FIG. 45B shows the correlationbetween identifier copy numbers in the original library versus the 2×accessed identifier library, and FIG. 45C shows the distribution ofidentifier copy numbers in the original identifier library versus the 2×accessed identifier library. Results indicate that little or no bias mayoccur during nested identifier access methods.

FIGS. 46A-C present data from after storing the original identifierlibrary representing the message (from FIG. 41 ) at 100° C. for 4 days.The original identifier library was dried down with a preservingadditive (biostabilizing material) and kept in a thermocycler held at100° C. for 4 days. FIG. 46A shows a heat map (341×351) of theabundances of sequences present in the stored identifier library asdetermined by sequencing. Each pixel represents an identifier and thegreyscale intensity at that pixel represents the relative abundance ofthat identifier compared to other identifiers in the row. Maximumgreyscale (dark) intensity is set at the average copy number foridentifiers in each row. Identifiers may be interpreted to represent bitvalues of ‘1’ if they are within the top 3 most abundant identifiers ina contiguous string of 13 identifiers (along the rows of the map). Allothers are interpreted to represent bit values of ‘0’. Applying thisdecoding processing step to the data results in zero identifier errorsand zero identifier erasures, and therefore a map that exactly matchesthe encoded message (FIG. 41A). FIG. 46B shows the correlation betweenidentifier copy numbers in the original identifier library versus thereplicated identifier library, and FIG. 46C shows the distribution ofidentifier copy numbers in the original identifier library versus thereplicated identifier library. Results indicate that little or no biasmay occur during extreme heating of the identifier library for prolongedperiods of time. Moreover, double stranded DNA quantitation (with Qubitfluorometric quantitation) yielded similar values between the originalidentifier library (36.4 ng/mL) and the stored identifier library (41.2ng/mL), indicating that there may have been little to no loss of DNAduring the incubation.

Table 8 presents statistics from writing and reading the identifierlibraries representing the message and accessed portions of the message(from FIGS. 41-46 ). For each library, we report the total number ofreads of identifiers that represent bit values of ‘0’ (falseidentifiers), the total number of reads of identifiers that representbit values of ‘1’ (true identifiers), the fraction of false identifiersthat were sequenced (“identifier error rate”), the total number ofcodewords, the number of codeword erasures, and the number of codeworderrors. The distribution of identifiers in each codeword was modeled asa multinomial distribution where each of the false identifiers areidentically distributed and each of the true identifiers are identicallydistributed, and the probability of reading (sampling) a falseidentifier is equivalent to the identifier error rate. Using the numberof codewords represented in each library, and the number of identifiersreads from each codeword as the sample size for each codeword, we usedthe model to calculate the expected number of codeword erasures andcodeword errors. Due to computational intractability of calculating theprobability of a codeword erasure or a codeword error at a large samplesize, any sample size of greater than 40 reads was bound at 40. Thus theexpectation values should be considered as upper bounds. Resultsindicate that the erased codeword in the replicated library (FIG. 43A,FIG. 43B, and FIG. 43C) may have been expected due to intrinsic samplingnoise.

TABLE 8 Statistics from writing and reading the identifier librariesIdentifier library Original Repeated Replicated Accessed 2x accessedStored From Figure FIG. 41 FIG. 42 FIG. 43 FIG. 44 FIG. 45 FIG. 46 Trueidentifier reads 1879590 1815322 641682 104474 94301 4327130 Falseidentifier reads 3494 940 1117 221 205 8588 Identifier error rate0.00186 0.00052 0.00174 0.00211 0.00217 0.00198 Total codewords 91819181 9181 1323 162 9181 Codeword erasures 0 0 1 0 0 0 Codeword errors 00 0 0 0 0 Expected number of 0.00812 0.02793 1.19021 0.09196 0.000140.00788 codeword erasures (upper bound) Expected number of 0.000310.00099 0.03322 0.00318 0.00001 0.00030 codeword errors (upper bound)

Example 13: A Study of the Stability of DNA

FIGS. 47A-D presents data for DNA samples incubated for 8 days in 4different temperatures. Multiple samples each of approximately 250 ng of˜450 base DNA (the target) was dried with preserving additive (BMrepresents biostabilizing material) and heated at 75.1° C. (FIG. 47A),84.4° C. (FIG. 47B), 90.2° C. (FIG. 47C), or 95.0° C. (FIG. 47D) for 8days. At different time points over the 8 days, samples were removed andstored at room temperature until final measurement at the end of the 8days. At the final measurement, the relative amount of target DNA ineach sample was quantified with qPCR. Quantitation values are normalizedto the zero timepoint samples that were not heated. Results indicatethat minimal DNA degradation may take place, even with prolongedincubation at high temperatures.

Example 14: A Study of the Effect of Glycerol on Ligation

FIG. 48 presents data from ligation of 4 sticky-ended (with 6-base, 3′overhangs) DNA components ligated together with various amounts (interms of percent volume-per-volume) of glycerol. Ligation was performedwith 0.125 μM each DNA component and 5 CEU/μL T4 Ligase (100 CEUoverall) at 25° C. Ligation time was 2.5 minutes. Efficiency wasmeasured using qPCR relative to a full length control representing thefully ligated product. Results indicate that adding up to 20% or moreglycerol may not affect ligation, but that adding 40% or more may beinhibitory.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

1-63. (canceled)
 64. A method for writing information into a nucleicacid sequence, comprising: (a) providing a device comprising a pluralityof partitions; (b) determining a string of symbols to represent saidinformation; (c) constructing, on at least one partition, a plurality ofcomponents, wherein each individual component of said plurality ofcomponents is a nucleic acid molecule having a nucleic acid sequence;(d) altering the temperature in the at least one partition; and (e)chemically linking together two or more components of said plurality ofcomponents, thereby generating one or more identifiers of an identifierlibrary, wherein each identifier of said one or more identifierscomprises two or more components.
 65. The method of claim 64, wherein anindividual identifier of said one or more identifiers corresponds to anindividual symbol in said string of symbols.
 66. The method of claim 65,wherein one symbol value at each position of said string of symbols isrepresented by the absence of a distinct identifier in the identifierlibrary.
 67. The method of claim 64, wherein said two or more componentsare assembled in a fixed order.
 68. The method of claim 64, wherein saidtwo or more components are assembled with one or more partitioningcomponents disposed between two components from different layers of saidtwo or more layers.
 69. The method of claim 64, wherein said pluralityof nucleic acid sequences stores metadata of said information orconceals said information.
 70. The method of claim 64, furthercomprising: combining two or more identifier libraries; and tagging eachidentifier library of said two or more identifier libraries with adistinct barcode.
 71. The method of claim 64, wherein each individualidentifier in said identifier library comprises a distinct barcode. 72.The method of claim 71, wherein said distinct barcode of each individualidentifier has a minimum hamming distance from the distinct barcode ofother individual identifiers.
 73. The method of claim 64, whereinchemically linking comprises hybridizing two single-stranded nucleicacid molecules.
 74. The method of claim 73, wherein hybridizingcomprises annealing two single-stranded nucleic acid molecules.
 75. Themethod of claim 64, wherein chemically linking comprises ligatingtogether two or more components of said plurality of components using areagent.
 76. The method of claim 75, wherein said reagent is a T4ligase, a T7 ligase, a T3 ligase, Taq ligase, Chlorella virus ligase,Thermococcus sp. strain 9° N ligase, or an E. coli ligase.
 77. Themethod of claim 76, further comprising inactivating said ligases using abuffer containing EDTA or guanidine thiocyanate.
 78. The method of claim75, wherein said reagent comprises polyethylene glycol (PEG), dimethylsulfoxide (DMSO), 1,2-Propanediol (1,2-Prd) glycerol, or polysorbate 20.79. The method of claim 75, wherein said reagent further comprisesglycerol molecules.
 80. The method of claim 64, wherein chemicallylinking in (e) comprises using overlap-extension polymerase chainreaction (PCR).
 81. The method of claim 64, further comprisingdehydrating said identifier library by dehydrating each individualidentifier of at least said subset of said one or more identifiers. 82.The method of claim 64, further comprising amplifying said one or moreidentifiers with PCR or linear amplification.
 83. The method of claim82, wherein said PCR has at least 10 cycles.